Skip to content

Commit 163c727

Browse files
committed
RNN update
1 parent 4268b1c commit 163c727

File tree

2 files changed

+71
-7
lines changed

2 files changed

+71
-7
lines changed
457 KB
Loading

rnn.md

Lines changed: 71 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -62,19 +62,19 @@ $$
6262
h_t = f_W(h_{t-1}, x_t)
6363
$$
6464

65-
where at every timestep it receives some previous state as a vector $$h_{t-1}$$ at previous
65+
where at every timestep it receives some previous state as a vector $$h_{t-1}$$ of previous
6666
iteration timestep $$t-1$$ and current input vector $$x_t$$ to produce the current state as a vector
67-
$$h_t$$. The same function is used at every single timestep. We have a fixed function $$f_W$$ of
68-
weights $$W$$ and we applied that single function at every single timestep and that allows us to use
67+
$$h_t$$. A fixed function $$f_W$$ with
68+
weights $$W$$ is applied at every single timestep and that allows us to use
6969
the Recurrent Neural Network on sequences without having to commit to the size of the sequence because
7070
we apply the exact same function at every single timestep, no matter how long the input or output
7171
sequences are.
7272

7373
In the most simplest form of RNN, which we call a Vanilla RNN, the network is just a single hidden
7474
state $$h$$ where we use a recurrence formula that basically tells us how we should update our hidden
75-
state $$h$$ as a function of previous hidden state and the current input $$x_t$$. In particular, we're
76-
going to have these weight matrices $$W_{hh}$$ and $$W_{xh}$$, where they will project both the hidden
77-
state from the previous timestep and the current input $$x_t$$, and then those are going to be summed
75+
state $$h$$ as a function of previous hidden state $$h_{t-1}$$ and the current input $$x_t$$. In particular, we're
76+
going to have weight matrices $$W_{hh}$$ and $$W_{xh}$$, where they will project both the hidden
77+
state $$h_{t-1}$$ from the previous timestep and the current input $$x_t$$, and then those are going to be summed
7878
and squished with $$tanh$$ function to update the hidden state $$h_t$$ at timestep $$t$$. This recurrence
7979
is telling us how $$h$$ will change as a function of its history and also the current input at this
8080
timestep:
@@ -83,7 +83,7 @@ $$
8383
h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t)
8484
$$
8585

86-
We can base predictions on top of $$h$$, for example, by using just another matrix projection on top
86+
We can base predictions on top of $$h_t$$ by using just another matrix projection on top
8787
of the hidden state. This is the simplest complete case in which you can wire up a neural network:
8888

8989
$$
@@ -100,6 +100,70 @@ with semantics in the following section.
100100

101101
## RNN example as Character-level language model
102102

103+
One of the simplest ways in which we can use an RNN is in the case of a character-level language model
104+
since it's intuitive to understand. The way this RNN will work is we will feed a sequence of characters
105+
into the RNN and at every single timestep, we will ask the RNN to predict the next character in the
106+
sequence. The prediction of RNN will be in the form of score distribution of the characters in the vocabulary
107+
for what RNN thinks should come next in the sequence that it has seen so far.
108+
109+
So suppose, in a very simple example (Figure 3), we have the training sequence of just one string $$\text{"hello"}$$, and we have a vocabulary
110+
$$V \in \{\text{"h"}, \text{"e"}, \text{"l"}, \text{"o"}\}$$ of 4 characters in the entire dataset. We are going to try to get an RNN to
111+
learn to predict the next character in the sequence on this training data.
112+
113+
<div class="fig figcenter fighighlight">
114+
<img src="/assets/rnn/char_level_language_model.png" width="60%" >
115+
<div class="figcaption">Figure 3. Simplified Character-level Language Model RNN.</div>
116+
</div>
117+
118+
As shown in Figure 3, we'll feed in one character at a time into an RNN, first $$\text{"h"}$$, then
119+
$$\text{"e"}$$, then $$\text{"l"}$$, and finally $$\text{"l"}$$. All characters are encoded in the representation
120+
of what's called a one-hot vector, where only one unique bit of the vector is turned on for each unique
121+
character in the vocabulary. For example:
122+
123+
$$
124+
\begin{bmatrix}1 \\ 0 \\ 0 \\ 0 \end{bmatrix} = \text{"h"}\ \
125+
\begin{bmatrix}0 \\ 1 \\ 0 \\ 0 \end{bmatrix} = \text{"e"}\ \
126+
\begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \end{bmatrix} = \text{"l"}\ \
127+
\begin{bmatrix}0 \\ 0 \\ 0 \\ 1 \end{bmatrix} = \text{"o"}
128+
$$
129+
130+
Then we're going to use the recurrence formula from the previous section at every single timestep.
131+
Suppose we start off with $$h$$ as a vector of size 3 with all zeros. By using this fixed recurrence
132+
formula, we're going to end up with a 3-dimensional representation of the next hidden state $$h$$
133+
that basically at any point in time summarizes all the characters that have come until then:
134+
135+
$$
136+
\begin{aligned}
137+
\begin{bmatrix}0.3 \\ -0.1 \\ 0.9 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0 \\ 0 \\ 0 \end{bmatrix} + W_{xh}\begin{bmatrix}1 \\ 0 \\ 0 \\ 0 \end{bmatrix}) \ \ \ \ &(1) \\
138+
\begin{bmatrix}1.0 \\ 0.3 \\ 0.1 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0.3 \\ -0.1 \\ 0.9 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 1 \\ 0 \\ 0 \end{bmatrix}) \ \ \ \ &(2) \\
139+
\begin{bmatrix}0.1 \\ -0.5 \\ -0.3 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}1.0 \\ 0.3 \\ 0.1 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \end{bmatrix}) \ \ \ \ &(3) \\
140+
\begin{bmatrix}-0.3 \\ 0.9 \\ 0.7 \end{bmatrix} &= f_W(W_{hh}\begin{bmatrix}0.1 \\ -0.5 \\ -0.3 \end{bmatrix} + W_{xh}\begin{bmatrix}0 \\ 0 \\ 1 \\ 0 \end{bmatrix}) \ \ \ \ &(4)
141+
\end{aligned}
142+
$$
143+
144+
As we apply this recurrence at every timestep, we're going to predict what should be the next character
145+
in the sequence at every timestep. Since we have four characters in vocabulary $$V$$, we're going to
146+
predict 4-dimensional vector of logits at every single timestep.
147+
148+
As shown in Figure 3, in the very first timestep we fed in $$\text{"h"}$$, and the RNN with its current
149+
setting of weights computed a vector of logits:
150+
151+
$$
152+
\begin{bmatrix}1.0 \\ 2.2 \\ -3.0 \\ 4.1 \end{bmatrix} \rightarrow \begin{bmatrix}\text{"h"} \\ \text{"e"} \\ \text{"l"}\\ \text{"o"} \end{bmatrix}
153+
$$
154+
155+
where RNN thinks that the next character $$\text{"h"}$$ is $$1.0$$ likely to come next, $$\text{"e"}$$ is $$2.2$$ likely,
156+
$$\text{"e"}$$ is $$-3.0$$ likely, and $$\text{"o"}$$ is $$4.1$$ likely to come next. In this case,
157+
RNN incorrectly suggests that $$\text{"o"}$$ should come next, as the score of $$4.1$$ is the highest.
158+
However, of course, we know that in this training sequence $$\text{"e"}$$ should follow $$\text{"h"}$$,
159+
so in fact the score of $$2.2$$ is the correct answer as it's highlighted in green in Figure 3, and we want that to be high and all other scores
160+
to be low. At every single timestep we have a target for what next character should come in the sequence,
161+
therefore the error signal is backpropagated as a gradient of the loss function through the connections.
162+
As a loss function we could choose to have a softmax classifier, for example, so we just get all those losses
163+
flowing down from the top backwards to calculate the gradients on all the weight matrices to figure out how to
164+
shift the matrices so that the correct probabilities are coming out of the RNN. Similarly we can imagine
165+
how to scale up the training of the model over larger training dataset.
166+
103167

104168

105169

0 commit comments

Comments
 (0)