The vanishing gradient problem
In backpropagation algorithm, the weights are adjusted in proportion to the gradient error, and for the way in which the gradients are computed. Let's check the following:
- If the weights are small, it can lead to a situation called vanishing gradients where the gradient signal gets so small that learning either becomes very slow or stops working altogether. This is often referred to as vanishing gradients.
- If the weights in this matrix are large it can lead to a situation where the gradient signal is so large that it can cause learning to diverge. This is often referred to as exploding gradients.
The vanishing-exploding gradient problem also afflicts RNNs. In fact, the BPTT rolls out the RNN creating a very deep feed-forward neural network. The impossibility of having a long-term context by the RNN is due precisely to this phenomenon, if the gradient vanishes or explodes within a few layers, the network will not be able to learn high temporal distance relationships...