Skip to content

Commit 5d78e37

Browse files
authored
Merge pull request #4 from sandra-haerin-ha/rnn
Rnn
2 parents f285a6c + 7d3fef2 commit 5d78e37

File tree

1 file changed

+14
-9
lines changed

1 file changed

+14
-9
lines changed

rnn.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -207,7 +207,7 @@ So far we have seen only a simple recurrence formula for the Vanilla RNN. In pra
207207
rarely ever use Vanilla RNN formula. Instead, we will use what we call a Long-Short Term Memory (LSTM)
208208
RNN.
209209

210-
### Vanilla RNN Gradient Flow
210+
### Vanilla RNN Gradient Flow & Vanishing Gradient Problem
211211
An RNN block takes in input $$x_t$$ and previous hidden representation $$h_{t-1}$$ and learn a transformation, which is then passed through tanh to produce the hidden representation $$h_{t}$$ for the next time step and output $$y_{t}$$ as shown in the equation below.
212212

213213
$$ h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$
@@ -223,15 +223,20 @@ We update the weights $$W$$ by getting the derivative of the loss at the very la
223223
= \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)W_{hh}^{T-1})\frac{\partial h_{1}}{\partial W}
224224
\end{aligned}
225225
$$
226-
* Vanishing gradient: We see that $$tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)$$ will almost always be less than 1 because tanh is always between negative one and one.
227-
Thus, as $$t$$ gets larger (i.e. longer timesteps), the gradient ($$\frac{\partial L_{t}}{\partial W} $$) will descrease in value and get close to zero.
226+
* **Vanishing gradient:** We see that $$tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)$$ will almost always be less than 1 because tanh is always between negative one and one. Thus, as $$t$$ gets larger (i.e. longer timesteps), the gradient ($$\frac{\partial L_{t}}{\partial W} $$) will descrease in value and get close to zero.
228227
This will lead to vanishing gradient problem, where gradients at future time steps rarely impact gradients at the very first time step. This is problematic when we model long sequence of inputs because the updates will be extremely slow.
229228
229+
* **Removing non-linearity (tanh):** If we remove non-linearity (tanh) to solve the vanishing gradient problem, then we will be left with
230+
$$
231+
\begin{aligned}
232+
\frac{\partial L_{t}}{\partial W} = \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} W_{hh}^{T-1})\frac{\partial h_{1}}{\partial W}
233+
\end{aligned}
234+
$$
235+
* Exploding gradients: If the largest singular value of W_{hh} is greater than 1, then the gradients will blow up and the model will get very large gradients coming back from future time steps. Exploding gradient often leads to getting gradients that are NaNs.
236+
* Vanishing gradients: If the laregest singular value of W_{hh} is smaller than 1, then we will have vanishing gradient problem as mentioned above which will significantly slow down learning.
237+
238+
In practice, we can treat the exploding gradient problem through gradient clipping, which is clipping large gradient values to a maximum threshold. However, since vanishing gradient problem still exists in cases where largest singular value of W_{hh} matrix is less than one, LSTM was designed to avoid this problem.
230239
231-
Now, of course, you might
232-
ask, what if we just got rid of this nonlinearity?
233-
234-
235-
### Gradient Flow
236240
237-
### Vanishing gradient problem
241+
242+

0 commit comments

Comments
 (0)