You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: rnn.md
+11-4Lines changed: 11 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -292,20 +292,27 @@ calculation cache that's later used with $$o$$ gate in the above formulas.
292
292
Since all $$f, i, o$$ gate vector values range from 0 to 1, because they were squashed by sigmoid function
293
293
$$\sigma$$, when multiplied element-wise, we can see that:
294
294
295
-
* Forget gate $$f_t$$ at time step $$t$$ controls how much information needs to be "removed" from the previous cell state $$c_{t-1}$$
296
-
* Input gate $$i_t$$ at time step $$t$$ controls how much information needs to be "added" to the next cell state $$c_t$$ from previous hidden state $$h_{t-1}$$ and input $$x_t$$
297
-
* Output gate $$o_t$$ at time step $$t$$ controls how much information needs to be "shown" as output in the current hidden state $$h_t$$
295
+
***Forget Gate:** Forget gate $$f_t$$ at time step $$t$$ controls how much information needs to be "removed" from the previous cell state $$c_{t-1}$$.
296
+
This forget gate learns to erase hidden representations from the previous time steps, which is why LSTM will have two
297
+
hidden represtnations $$h_t$$ and cell state $$c_t$$. This $$c_t$$ will get propagated over time and learn whether to forget
298
+
the previous cell state or not.
299
+
***Input Gate:** Input gate $$i_t$$ at time step $$t$$ controls how much information needs to be "added" to the next cell state $$c_t$$ from previous hidden state $$h_{t-1}$$ and input $$x_t$$. Instead of tanh, the "input" gate $$i$$ has a sigmoid function, which converts inputs to values between zero and one.
300
+
This serves as a switch, where values are either almost always zero or almost always one. This "input" gate decides whether to take the RNN output that is produced by the "gate" gate $$g$$ and multiplies the output with input gate $$i$$.
301
+
***Output Gate:** Output gate $$o_t$$ at time step $$t$$ controls how much information needs to be "shown" as output in the current hidden state $$h_t$$.
298
302
299
303
The key idea of LSTM is the cell state, the horizontal line running through between recurrent timesteps. You can imagine the cell
300
304
state to be some kind of highway of information passing through straight down the entire chain, with
301
305
only some minor linear interactions. With the formulation above, it's easy for information to just flow
302
-
along this highway (Figure 5). This greatly fixes the gradient vanishing/exploding problem we have outlined above.
306
+
along this highway (Figure 5). Thus, even when there is a bunch of LSTMs stacked together, we can get an uninterrupted gradient flow where the gradients flow back through cell states instead of hidden states $$h$$ without vanishing in every time step.
307
+
308
+
This greatly fixes the gradient vanishing/exploding problem we have outlined above. Figure 5 also shows that gradient contains a vector of activations of the "forget" gate. This allows better control of gradients values by using suitable parameter updates of the "forget" gate.
0 commit comments