Deep Learning Networks: Advantages of ReLU over Sigmoid Function
This was posted as a question on StackExchange. The state of the art of non-linearity is to use rectified linear units (ReLU) instead of sigmoid function in deep neural network. What are the advantages? I know that training a network when ReLU is used would be faster, and it is more biological inspired, what are the other advantages? (That is, any disadvantages of using sigmoid)?
Below is the best answer.
Sigmoid: not blowing up activation
Relu : not vanishing gradient
Relu : More computationally efficient to compute than Sigmoid like functions since Relu just needs to pick max(0, x) and not perform expensive exponential operations as in Sigmoids
Relu : In practice, networks with Relu tend to show better convergence performance than sigmoid. (Krizhevsky et al.)
Sigmoid: tend to vanish gradient (cause there is a mechanism to reduce the gradient as “a” increases, where “a” is the input of a sigmoid function. Gradient of Sigmoid: S′(a)=S(a)(1−S(a)). When “a” grows to infinite large, S′(a)=S(a)(1−S(a))=1×(1−1)=0.
Relu : tend to blow up activation (there is no mechanism to constrain the output of the neuron, as “a” itself is the output)
Relu : Dying Relu problem – if too many activations get below zero then most of the units(neurons) in network with Relu will simply output zero, in other words, die and thereby prohibiting learning.(This can be handled, to some extent, by using Leaky-Relu instead.)
Read full discussion here.
Invitation to Join Data Science Central
Free Book: Applied Stochastic Processes
Comprehensive Repository of Data Science and ML Resources
Advanced Machine Learning with Basic Excel
Difference between ML, Data Science, AI, Deep Learning, and Statistics
Selected Business Analytics, Data Science and ML articles
Hire a Data Scientist | Search DSC | Classifieds | Find a Job
Post a Blog | Forum Questions
Link: Deep Learning Networks: Advantages of ReLU over Sigmoid Function