This image is quoted from link.

1. Why does the vanishing gradient problem happen?

As the Back-propagation Algorithm is running, the weights in the earliest layer of the network need to be updated using $\frac{\partial Error}{\partial w}$ . Based on the chain rule, the deeper the network is, the more multiplication operations need to be stacked together, which means more gradients need to be multiplied together to calculate the stride $\frac{\partial Error}{\partial w}$ . If the activation function is as $Sigmoid()$ , the gradients are approaching $0$ s when the absolute value of outputs of linear transformations, the strides will come to $0$ s, the earlier layers cannot be updated effectively.

2. What is the vanishing gradient problem?

Just as above, the gradients of earlier layers is decreased to $0$ s exponentially because of the improper activation functions chosen in latter layers, and the large amount of the latter layers stacked. It causes that the weights in initial layers can not be updated effectively during training, so that the whole network is inaccurate.

3. How can we solve this problem?

3.1. Avoid

As we all know, $ReLU$ or $Leaky\ ReLU$ is not to tend to be gradient vanishing. So feel free to change a more proper activation.

3.2. Residual Connections

If we use the Residual block, the residual connection can lead to $x+F(x)$ . This addition operation can help the degree of changing gradient skip the block that squashes the derivatives.

3.3. Normalize Outputs into 'Sweet' Range

If we use batch normalization on the inputs whose value is too large to map the large input to a small range. The inputs cannot reach to the range whose corresponding activations approach to zeros.

References

towardsdatascience.com/the-vanishi…

Vanishing Gradient Problem