Pre-train using sigmoid and train using ReLU?

Question

Using RBMs to pre-train a deep net as in this example RBM, the activation function is sigmoid and makes the math much easier.

What are the implications after the initial weights are learned using sigmoid activation functions to switch to ReLU for the train phase?

I suppose that using tanh in either phase (pre-train or train) and sigmoid or ReLU in the other would cause great problems, but since ReLU and sigmoid are similar for small values, would it still render the pre-train phase useless?

The question can be more general in the sense of how much knowledge can you transfer from a neural network using sigmoid activation functions to one identical in structure, but using ReLU activation functions.

ReLU( 0 ) = 0, Sigmoid( 0 ) = 0.5. I don't think they are similar for small values, at least not without adjusting bias terms. — Neil Slater, Jun 27 '16 at 16:03
@NeilSlater You are correct. I was more focused on activation, in the sense that positive input values give activation to both of them, even if the resulting activation value is a bit different. — Radu Ionescu, Jun 28 '16 at 07:43
The question is if pre-trained RBMs with sigmoid function are valid and good initializers for a deep net with ReLU activation — Radu Ionescu, Jun 28 '16 at 07:57

score 1 · Answer 1 · edited Sep 17 '17 at 20:07

Since RBM has only one layer of weights, why you bother changing sigmoid to ReLU in 1-layer net? Gradient vanishing is very unlikely to happen in such a shallow net.

You can also train Gaussian-Bernoulli or Gaussian-Gaussian RBM (more here), which has identity activation function, which is closer to ReLU than sigmoid, and what is more justified if you have real-valued data, not binary. However, these types of networks are a bit less stable to train because of this unconstrained activation.

Pre-train using sigmoid and train using ReLU?

1 Answers1