The Dying ReLU Problem, Clearly Explained

Keep your neural network alive by understanding the downsides of ReLU

Mar 30, 2021

6 min read

Thoughts and Theory

(1) What is ReLU and what are its advantages? (2) What’s the Dying ReLU problem? (3) What causes the Dying ReLU problem? (4) How to solve the Dying ReLU problem?

Activation functions are mathematical equations that define how the weighted sum of the input of a neural node is transformed into an output, and they are key parts of an artificial neural network (ANN) architecture.

Activation functions add non-linearity to a neural network, allowing the network to learn complex patterns in the data. The choice of activation function has a significant impact on an ANN’s performance, and one of the most popular choices is the Rectified Linear Unit (ReLU).

What is ReLU, and what are its advantages?

The Rectified Linear Unit (ReLU) activation function can be described as:

f(x) = max(0, x)

What it does is: (i) For negative input values, output = 0 (ii) For positive input values, output = original input value

👁 Graphic representation of ReLU activation function

Graphic representation of ReLU activation function

ReLU has gained massive popularity because of several key advantages:

ReLU takes less time to learn and is computationally less expensive than other common activation functions (e.g., tanh, sigmoid). Because it outputs 0 whenever its input is negative, fewer neurons will be activated, leading to network sparsity and thus higher computational efficiency.
ReLU involves simpler mathematical operations compared to tanh and sigmoid, thereby boosting its computational performance further.
tanh and sigmoid functions are prone to the vanishing gradient problem, where gradients shrink drastically in backpropagation such that the network is no longer able to learn. ReLU avoids this by preserving the gradient since: (i) its linear portion (in positive input range) allows gradients to flow well on active paths of neurons and remain proportional to node activations (ii) it is an unbounded function (i.e., no max value).

What’s the Dying ReLU problem?

The dying ReLU problem refers to the scenario when many ReLU neurons only output values of 0. The red outline below shows that this happens when the inputs are in the negative range.

👁 Red outline (in the negative x range) demarcating the horizontal segment where ReLU outputs 0

Red outline (in the negative x range) demarcating the horizontal segment where ReLU outputs 0

While this characteristic gives ReLU its strengths (through network sparsity), it becomes a problem when most of the inputs to these ReLU neurons are in the negative range. The worst-case scenario is when the entire network dies, meaning that it becomes just a constant function.

When most of these neurons return output zero, the gradients fail to flow during backpropagation, and the weights are not updated. Ultimately a large part of the network becomes inactive, and it is unable to learn further.

Because the slope of ReLU in the negative **** input range is also zero, once it becomes dead (i.e., stuck in negative range and giving output 0), it is likely to remain unrecoverable.

However, the dying ReLU problem does not happen all the time since the optimizer (e.g., stochastic gradient descent) considers multiple input values each time. As long as NOT all the inputs push ReLU to the negative segment (i.e., some inputs are in the positive range), the neurons can stay active, the weights can get updated, and the network can continue learning.

What causes the Dying ReLU problem?

The dying ReLU problem is commonly driven by these two factors:

(i) High learning rate

Let us first look at the equation for the update step in backpropagation:

👁 Equation for update rule (Image by author)

Equation for update rule (Image by author)

If our learning rate (α) is set too high, there is a significant chance that our new weights will end up in the highly negative value range since our old weights will be subtracted by a large number. These negative weights result in negative inputs for ReLU, thereby causing the dying ReLU problem to happen.

Note: Recall that input to activation function is (W*****x) + b.

(ii) Large negative bias

👁 Illustration of a simple neural network (Image by author)

Illustration of a simple neural network (Image by author)

While we have mostly talked about weights so far, we must not forget that the bias term is also passed along with the weights into the activation function.

Bias is a constant value added to the product of inputs and weights. Given its involvement, a large negative bias term can cause the ReLU activation inputs to become negative. This, as already described, causes the neurons to consistently output 0, leading to the dying ReLU problem.

How to solve the Dying ReLU problem?

There are several ways to tackle the dying ReLU problem:

(i) Use of a lower learning rate

Since a large learning rate results in a higher likelihood of negative weights (thereby increasing chances of dying ReLU), it can be a good idea to decrease the learning rate during the training process.

(ii) Variations of ReLU

Since the flat section in the negative input range causes the dying ReLU problem, a natural instinct would be to consider ReLU variations that adjust this flat segment.

Leaky ReLU is a common effective method to solve a dying ReLU problem, and it does so by adding a slight slope in the negative range. This modifies the function to generate small negative outputs when input is less than 0.

👁 Graphical comparison of ReLU and Leaky ReLU (Image by author)

Graphical comparison of ReLU and Leaky ReLU (Image by author)

There are other variations such as parametric ReLU (PReLU), exponential linear unit (ELU), and Gaussian error linear units (GELU). Their details are out of the scope of this article, but they all have a common objective, i.e., prevent the dying ReLU problem by avoiding zero-slope segments.

(iii) Modification of initialization procedure

A common way for initializing weights and biases for neural networks is through symmetric probability distributions (e.g., He initialization). However, such a method is prone to the dying ReLU problem due to bad local minima.

It has been demonstrated that using a randomized asymmetric initialization can help prevent the dying ReLU problem. Do check out the arXiv paper for the mathematical details.

Conclusion

With ReLU widely used in popular ANNs like multilayer perceptrons and convolutional neural networks, this article addresses the theoretical concept, practical significance, and potential solutions to the dying ReLU problem.

Before You Go

I welcome you to join me on a data science learning journey! Follow this Medium page or check out my GitHub to stay in the loop of more exciting data science content. Meanwhile, have fun applying ReLU in your networks!

PyTorch Ignite Tutorial— Classifying Tiny ImageNet with EfficientNet

Most Starred & Forked GitHub Repos for Data Science and Python

Written By

Kenneth Leung

See all from Kenneth Leung

Data Science, Deep Learning, Machine Learning, Neural Networks, Thoughts And Theory

Share This Article

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

URL: https://towardsdatascience.com/the-dying-relu-problem-clearly-explained-42d0c54e0d24/