Update rnn.md

sandra-haerin-ha · web-flow · commit cbb3548c6d2c · 2021-06-15T22:52:12.000-07:00
diff --git a/rnn.md b/rnn.md
@@ -292,20 +292,27 @@ calculation cache that's later used with $$o$$ gate in the above formulas.
 Since all $$f, i, o$$ gate vector values range from 0 to 1, because they were squashed by sigmoid function
 $$\sigma$$, when multiplied element-wise, we can see that:
 
-  * Forget gate $$f_t$$ at time step $$t$$ controls how much information needs to be "removed" from the previous cell state $$c_{t-1}$$
-  * Input gate $$i_t$$ at time step $$t$$ controls how much information needs to be "added" to the next cell state $$c_t$$ from previous hidden state $$h_{t-1}$$ and input $$x_t$$
-  * Output gate $$o_t$$ at time step $$t$$ controls how much information needs to be "shown" as output in the current hidden state $$h_t$$
+  * **Forget Gate:** Forget gate $$f_t$$ at time step $$t$$ controls how much information needs to be "removed" from the previous cell state $$c_{t-1}$$. 
+This forget gate learns to erase hidden representations from the previous time steps, which is why LSTM will have two 
+hidden represtnations $$h_t$$ and cell state $$c_t$$. This $$c_t$$ will get propagated over time and learn whether to forget
+the previous cell state or not.
+  * **Input Gate:** Input gate $$i_t$$ at time step $$t$$ controls how much information needs to be "added" to the next cell state $$c_t$$ from previous hidden state $$h_{t-1}$$ and input $$x_t$$. Instead of tanh, the "input" gate $$i$$ has a sigmoid function, which converts inputs to values between zero and one.
+This serves as a switch, where values are either almost always zero or almost always one. This "input" gate decides whether to take the RNN output that is produced by the "gate" gate $$g$$ and multiplies the output with input gate $$i$$.
+  * **Output Gate:** Output gate $$o_t$$ at time step $$t$$ controls how much information needs to be "shown" as output in the current hidden state $$h_t$$. 
 
 The key idea of LSTM is the cell state, the horizontal line running through between recurrent timesteps. You can imagine the cell
 state to be some kind of highway of information passing through straight down the entire chain, with
 only some minor linear interactions. With the formulation above, it's easy for information to just flow
-along this highway (Figure 5). This greatly fixes the gradient vanishing/exploding problem we have outlined above.
+along this highway (Figure 5). Thus, even when there is a bunch of LSTMs stacked together, we can get an uninterrupted gradient flow where the gradients flow back through cell states instead of hidden states $$h$$ without vanishing in every time step.
+
+This greatly fixes the gradient vanishing/exploding problem we have outlined above. Figure 5 also shows that gradient contains a vector of activations of the "forget" gate. This allows better control of gradients values by using suitable parameter updates of the "forget" gate.
 
 <div class="fig figcenter fighighlight">
   <img src="/assets/rnn/lstm_highway.png" width="70%" >
   <div class="figcaption">Figure 5. LSTM cell state highway.</div>
 </div>
 
+### Does LSTM solve the vanishing gradient problem? 
 LSTM architecture makes it easier for the RNN to preserve information over many recurrent time steps. For example,
 if the forget gate is set to 1, and the input gate is set to 0, then the infomation of the cell state
 will always be preserved over many recurrent time steps. For a Vanilla RNN, in contrast, it's much harder to preserve information