You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
***Vanishing gradient:** We see that $$tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)$$ will almost always be less than 1 because tanh is always between negative one and one. Thus, as $$t$$ gets larger (i.e. longer timesteps), the gradient ($$\frac{\partial L_{t}}{\partial W} $$) will descrease in value and get close to zero.
235
238
This will lead to vanishing gradient problem, where gradients at future time steps rarely impact gradients at the very first time step. This is problematic when we model long sequence of inputs because the updates will be extremely slow.
where $$\odot$$ is an element-wise Hadamard product. $$g_t$$ in the above formulas is an intermediary
290
+
calculation cache that's later used with $$o$$ gate in the above formulas.
291
+
292
+
Since all $$f, i, o$$ gate vector values range from 0 to 1, because they were squashed by sigmoid function
293
+
$$\sigma$$, when multiplied element-wise, we can see that:
294
+
295
+
* Forget gate $$f_t$$ at time step $$t$$ controls how much information needs to be "removed" from the previous cell state $$c_{t-1}$$
296
+
* Input gate $$i_t$$ at time step $$t$$ controls how much information needs to be "added" to the next cell state $$c_t$$ from previous hidden state $$h_{t-1}$$ and input $$x_t$$
297
+
* Output gate $$o_t$$ at time step $$t$$ controls how much information needs to be "shown" as output in the current hidden state $$h_t$$
298
+
299
+
The key idea of LSTM is the cell state, the horizontal line running through between recurrent timesteps. You can imagine the cell
300
+
state to be some kind of highway of information passing through straight down the entire chain, with
301
+
only some minor linear interactions. With the formulation above, it's easy for information to just flow
302
+
along this highway (Figure 5). This greatly fixes the gradient vanishing/exploding problem we have outlined above.
0 commit comments