Skip to content

Commit 3369451

Browse files
Create rnn.md
1 parent 67598e0 commit 3369451

File tree

1 file changed

+26
-1
lines changed

1 file changed

+26
-1
lines changed

rnn.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ RNN will show different behaviors in terms of how its state evolves as it receiv
5959
We are also interested in producing an output based on the RNN state, so we can produce these output vectors on top of the RNN (as depicted in Figure 2.
6060

6161
If we unroll an RNN model (Right of Figure 2), then there are inputs (e.g. video frame) at different timesteps shown as $$x_1, x_2, x_3$$ ... $$x_t$$.
62-
RNN at each timestep takes in two inputs -- an input frame ($$x_i$$) and previous representation of what it seems so far (i.e. history) -- to generate an output $$y_i$$ and update its history, which will get forward propagated over time. All the RNN blocks in Figure 2 (Right) are the same block that share the same parameter, but takes in different input and history at each timestep.
62+
RNN at each timestep takes in two inputs -- an input frame ($$x_i$$) and previous representation of what it seems so far (i.e. history) -- to generate an output $$y_i$$ and update its history, which will get forward propagated over time. All the RNN blocks in Figure 2 (Right) are the same block that share the same parameter, but have different inputs and history at each timestep.
6363

6464
<div class="fig figcenter fighighlight">
6565
<img src="/assets/rnn/rnn_blackbox.png" width="16%" >
@@ -207,6 +207,31 @@ So far we have seen only a simple recurrence formula for the Vanilla RNN. In pra
207207
rarely ever use Vanilla RNN formula. Instead, we will use what we call a Long-Short Term Memory (LSTM)
208208
RNN.
209209

210+
### Vanilla RNN Gradient Flow
211+
An RNN block takes in input $$x_t$$ and previous hidden representation $$h_{t-1}$$ and learn a transformation, which is then passed through tanh to produce the hidden representation $$h_{t}$$ for the next time step and output $$y_{t}$$ as shown in the equation below.
212+
213+
$$ h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$
214+
215+
For the back propagation, Let's examine how the output at the very last timestep affects the weights at the very first time step.
216+
The partial derivative of $$h_t$$ with respect to $$h_{t-1}$$ is written as:
217+
$$ \frac{\partial h_t}{\partial h_{t-1}} = tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)W_{hh} $$
218+
219+
We update the weights $$W$$ by getting the derivative of the loss at the very last time step $$L_{t}$$ with respect to $$W$$.
220+
\begin{aligned}
221+
\frac{\partial L_{t}}{\partial W} = \frac{\partial L_{t}}{\partial h_{t}} \frac{\partial h_{t}}{\partial h_{t-1} } \dots \frac{\partial h_{1}}{\partial W} } \\
222+
= \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} \frac{\partial h_{t}}{\partial h_{t-1}})\frac{\partial h_{1}}{\partial W} \\
223+
= \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)W_{hh}^{T-1})\frac{\partial h_{1}}{\partial W}
224+
\end{aligned}
225+
$$
226+
* Vanishing gradient: We see that $$tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)$$ will almost always be less than 1 because tanh is always between negative one and one.
227+
Thus, as $$t$$ gets larger (i.e. longer timesteps), the gradient ($$\frac{\partial L_{t}}{\partial W} $$) will descrease in value and get close to zero.
228+
This will lead to vanishing gradient problem, where gradients at future time steps rarely impact gradients at the very first time step. This is problematic when we model long sequence of inputs because the updates will be extremely slow.
229+
230+
231+
Now, of course, you might
232+
ask, what if we just got rid of this nonlinearity?
233+
234+
210235
### Gradient Flow
211236
212237
### Vanishing gradient problem

0 commit comments

Comments
 (0)