Why does ensemble averaging work for neural networks? This is the main idea behind things like dropout.
Consider an example of a hypersurface defined by the following image (white means lowest Cost). We have two networks: yellow and red, each network has 2 weights, and adjusts them to end up in the white portion.
Clearly, if we average them after they were trained we will end up in the middle of the space, where the error is very high.
