Skip to content

Commit cbb3548

Browse files
Update rnn.md
1 parent 58f1405 commit cbb3548

File tree

1 file changed

+11
-4
lines changed

1 file changed

+11
-4
lines changed

rnn.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -292,20 +292,27 @@ calculation cache that's later used with $$o$$ gate in the above formulas.
292292
Since all $$f, i, o$$ gate vector values range from 0 to 1, because they were squashed by sigmoid function
293293
$$\sigma$$, when multiplied element-wise, we can see that:
294294

295-
* Forget gate $$f_t$$ at time step $$t$$ controls how much information needs to be "removed" from the previous cell state $$c_{t-1}$$
296-
* Input gate $$i_t$$ at time step $$t$$ controls how much information needs to be "added" to the next cell state $$c_t$$ from previous hidden state $$h_{t-1}$$ and input $$x_t$$
297-
* Output gate $$o_t$$ at time step $$t$$ controls how much information needs to be "shown" as output in the current hidden state $$h_t$$
295+
* **Forget Gate:** Forget gate $$f_t$$ at time step $$t$$ controls how much information needs to be "removed" from the previous cell state $$c_{t-1}$$.
296+
This forget gate learns to erase hidden representations from the previous time steps, which is why LSTM will have two
297+
hidden represtnations $$h_t$$ and cell state $$c_t$$. This $$c_t$$ will get propagated over time and learn whether to forget
298+
the previous cell state or not.
299+
* **Input Gate:** Input gate $$i_t$$ at time step $$t$$ controls how much information needs to be "added" to the next cell state $$c_t$$ from previous hidden state $$h_{t-1}$$ and input $$x_t$$. Instead of tanh, the "input" gate $$i$$ has a sigmoid function, which converts inputs to values between zero and one.
300+
This serves as a switch, where values are either almost always zero or almost always one. This "input" gate decides whether to take the RNN output that is produced by the "gate" gate $$g$$ and multiplies the output with input gate $$i$$.
301+
* **Output Gate:** Output gate $$o_t$$ at time step $$t$$ controls how much information needs to be "shown" as output in the current hidden state $$h_t$$.
298302

299303
The key idea of LSTM is the cell state, the horizontal line running through between recurrent timesteps. You can imagine the cell
300304
state to be some kind of highway of information passing through straight down the entire chain, with
301305
only some minor linear interactions. With the formulation above, it's easy for information to just flow
302-
along this highway (Figure 5). This greatly fixes the gradient vanishing/exploding problem we have outlined above.
306+
along this highway (Figure 5). Thus, even when there is a bunch of LSTMs stacked together, we can get an uninterrupted gradient flow where the gradients flow back through cell states instead of hidden states $$h$$ without vanishing in every time step.
307+
308+
This greatly fixes the gradient vanishing/exploding problem we have outlined above. Figure 5 also shows that gradient contains a vector of activations of the "forget" gate. This allows better control of gradients values by using suitable parameter updates of the "forget" gate.
303309

304310
<div class="fig figcenter fighighlight">
305311
<img src="/assets/rnn/lstm_highway.png" width="70%" >
306312
<div class="figcaption">Figure 5. LSTM cell state highway.</div>
307313
</div>
308314

315+
### Does LSTM solve the vanishing gradient problem?
309316
LSTM architecture makes it easier for the RNN to preserve information over many recurrent time steps. For example,
310317
if the forget gate is set to 1, and the input gate is set to 0, then the infomation of the cell state
311318
will always be preserved over many recurrent time steps. For a Vanilla RNN, in contrast, it's much harder to preserve information

0 commit comments

Comments
 (0)