2

Why does ensemble averaging work for neural networks? This is the main idea behind things like dropout.

Consider an example of a hypersurface defined by the following image (white means lowest Cost). We have two networks: yellow and red, each network has 2 weights, and adjusts them to end up in the white portion.

enter image description here

Clearly, if we average them after they were trained we will end up in the middle of the space, where the error is very high.

David Masip
  • 6,051
  • 2
  • 24
  • 61
Kari
  • 2,726
  • 2
  • 20
  • 49
  • 1
    Ensembling allows you to explore the bias-variance trade-off by taking high bias, low variance models, and combining them to reduce the overall bias more than the variance is increased. Sometimes the weak learners are created by randomly paring down a complex model, as in the case of a random forest. Understanding the bias-variance trade-off is key here; forget about neural networks. – Emre Apr 28 '18 at 00:09

1 Answers1

1

I think there is a missunderstanding in your question. In your question you imply that you take the average of the weights of the networks, and you should not do this. Instead, what you average are the predictions of different networks. For this reason, if you average two correct predictions the result will be a correct prediction, and the problem you were thinking about does not exist.

David Masip
  • 6,051
  • 2
  • 24
  • 61
  • Thank you, for me to accept the answer could you please tell if such an ensambling will provide a better generalization, and why is Dropout taking an average of different smaller typologies, but it still works contrary to my picture? Geoffrey Hinton talks about it here: https://youtu.be/vAVOY8frLlQ?t=4m17s

    Am I right in assuming taking average of the predictions will reduce performance during deployment?

    – Kari Apr 28 '18 at 03:01
  • Ensembling will provide a better generalization, and it will reduce your performance, unless you parallelize it among many different computers, each one with its own GPU. Dropout achieves the weights to be more independent, and still decrease the loss. I don't think that dropout corresponds to your picture, as, again, you are not averaging weights, but you are going to the white zones where you can prescind of some of the parameters. – David Masip Apr 28 '18 at 11:29