Skip to content

Commit f285a6c

Browse files
authored
Merge pull request #3 from sandra-haerin-ha/rnn
Rnn
2 parents af5b0b6 + 3369451 commit f285a6c

File tree

5 files changed

+139
-30
lines changed

5 files changed

+139
-30
lines changed

Gemfile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
source 'https://rubygems.org'
2+
3+
gem 'jekyll'

Gemfile.lock

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
GEM
2+
remote: https://rubygems.org/
3+
specs:
4+
addressable (2.7.0)
5+
public_suffix (>= 2.0.2, < 5.0)
6+
colorator (1.1.0)
7+
concurrent-ruby (1.1.9)
8+
em-websocket (0.5.2)
9+
eventmachine (>= 0.12.9)
10+
http_parser.rb (~> 0.6.0)
11+
eventmachine (1.2.7)
12+
ffi (1.15.1)
13+
forwardable-extended (2.6.0)
14+
http_parser.rb (0.6.0)
15+
i18n (1.8.10)
16+
concurrent-ruby (~> 1.0)
17+
jekyll (4.2.0)
18+
addressable (~> 2.4)
19+
colorator (~> 1.0)
20+
em-websocket (~> 0.5)
21+
i18n (~> 1.0)
22+
jekyll-sass-converter (~> 2.0)
23+
jekyll-watch (~> 2.0)
24+
kramdown (~> 2.3)
25+
kramdown-parser-gfm (~> 1.0)
26+
liquid (~> 4.0)
27+
mercenary (~> 0.4.0)
28+
pathutil (~> 0.9)
29+
rouge (~> 3.0)
30+
safe_yaml (~> 1.0)
31+
terminal-table (~> 2.0)
32+
jekyll-sass-converter (2.1.0)
33+
sassc (> 2.0.1, < 3.0)
34+
jekyll-watch (2.2.1)
35+
listen (~> 3.0)
36+
kramdown (2.3.1)
37+
rexml
38+
kramdown-parser-gfm (1.1.0)
39+
kramdown (~> 2.0)
40+
liquid (4.0.3)
41+
listen (3.5.1)
42+
rb-fsevent (~> 0.10, >= 0.10.3)
43+
rb-inotify (~> 0.9, >= 0.9.10)
44+
mercenary (0.4.0)
45+
pathutil (0.16.2)
46+
forwardable-extended (~> 2.6)
47+
public_suffix (4.0.6)
48+
rb-fsevent (0.11.0)
49+
rb-inotify (0.10.1)
50+
ffi (~> 1.0)
51+
rexml (3.2.3.1)
52+
rouge (3.26.0)
53+
safe_yaml (1.0.5)
54+
sassc (2.4.0)
55+
ffi (~> 1.9)
56+
terminal-table (2.0.0)
57+
unicode-display_width (~> 1.1, >= 1.1.1)
58+
unicode-display_width (1.7.0)
59+
60+
PLATFORMS
61+
ruby
62+
63+
DEPENDENCIES
64+
jekyll
65+
66+
BUNDLED WITH
67+
2.1.4

assets/rnn/UnrolledRNN.png

138 KB
Loading

assets/rnn/unrolledRNN.png

110 KB
Loading

rnn.md

Lines changed: 69 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ permalink: /rnn/
66

77
Table of Contents:
88

9-
- [Intro to RNN](#intro)
9+
- [Introduction to RNN](#intro)
1010
- [RNN example as Character-level language model](#char)
1111
- [Multilayer RNNs](#multi)
1212
- [Long-Short Term Memory (LSTM)](#lstm)
@@ -16,44 +16,55 @@ Table of Contents:
1616

1717
<a name='intro'></a>
1818

19-
## Intro to RNN
19+
## Introduction to RNN
2020

2121
In this lecture note, we're going to be talking about the Recurrent Neural Networks (RNNs). One
2222
great thing about the RNNs is that they offer a lot of flexibility on how we wire up the neural
2323
network architecture. Normally when we're working with neural networks (Figure 1), we are given a fixed sized
24-
input vector (red), then we process it with some hidden layers (green), and then we produce a
25-
fixed sized output vector (blue) as depicted in the leftmost model in Figure 1. Recurrent Neural
26-
Networks allow us to operate over sequences of input, output, or both at the same time. For
27-
example, in the case of image captioning, we are given a fixed sized image and then through an RNN
28-
we produce a sequence of words that describe the content of that image (second model in Figure 1).
29-
Or for example, in the case of sentiment classification in the NLP, we are given a sequence of words
30-
of the sentence and then we are trying to classify whether the sentiment of that sentence is
31-
positive or negative (third model in Figure 1). In the case of machine translation, we can have an
32-
RNN that takes a sequence of words of a sentence in English, and then this RNN is asked to produce
33-
a sequence of words of a sentence in French, for example (forth model in Figure 1). As a last case,
34-
we can have a video classification RNN where we might imagine classifying every single frame of
35-
video with some number of classes, and most importantly we don't want the prediction to be only a
36-
function of the current timestep (current frame of the video), but also all the timesteps (frames)
37-
that have come before it in the video (rightmost model in Figure 1). In general Recurrent Neural
38-
Networks allow us to wire up an architecture, where the prediction at every single timestep is a
39-
function of all the timesteps that have come up to that point.
24+
input vector (red), then we process it with some hidden layers (green), and we produce a
25+
fixed sized output vector (blue) as depicted in the leftmost model ("Vanilla" Neural Networks) in Figure 1.
26+
While **"Vanilla" Neural Networks** receive a single input and produce one label for that image, there are tasks where
27+
the model produce a sequence of outputs as shown in the one-to-many model in Figure 1. **Recurrent Neural Networks** allow
28+
us to operate over sequences of input, output, or both at the same time.
29+
* An example of **one-to-many** model is image captioning where we are given a fixed sized image and produce a sequence of words that describe the content of that image through RNN (second model in Figure 1).
30+
* An example of **many-to-one** task is action prediction where we look at a sequence of video frames instead of a single image and produce
31+
a label of what action was happening in the video as shown in the third model in Figure 1. Another example of many-to-one task is
32+
sentiment classification in NLP where we are given a sequence of words of a sentence and then classify what sentiment (e.g. positive or negative) that sentence is.
33+
* An example of **many-to-many** task is video-captioning where the input is a sequence of video frames and the output is caption that describes
34+
what was in the video as shown in the fourth model in Figure 1. Another example of many-to-many task is machine translation in NLP, where we can have an
35+
RNN that takes a sequence of words of a sentence in English, and then this RNN is asked to produce a sequence of words of a sentence in French.
36+
* There is a also a **variation of many-to-many** task as shown in the last model in Figure 1,
37+
where the model generates an output at every timestep. An example of this many-to-many task is video classification on a frame level
38+
where the model classifies every single frame of video with some number of classes. We should note that we don't want
39+
this prediction to only be a function of the current timestep (current frame of the video), but also all the timesteps (frames)
40+
that have come before this video.
41+
42+
In general, RNNs allow us to wire up an architecture, where the prediction at every single timestep is a
43+
function of all the timesteps that have come before.
4044

4145
<div class="fig figcenter fighighlight">
4246
<img src="/assets/rnn/types.png" width="100%">
43-
<div class="figcaption">Figure 1. Different (non-exhaustive) types of Recurrent Neural Network architectures. Red boxes are input vectors. Green boxes are hidden layers. Blue boxes are output vectors.</div>
47+
<div class="figcaption"> <b> Figure 1.</b> Different (non-exhaustive) types of Recurrent Neural Network architectures. Red boxes are input vectors. Green boxes are hidden layers. Blue boxes are output vectors.</div>
4448
</div>
4549

46-
A Recurrent Neural Network is basically a blackbox (Figure 2), where it has a state and it receives through
47-
timesteps input vectors. At every single timestep we feed in an input vectors into the RNN and it
48-
can modify that state as a function of what it receives at every single timestep. There are weights
49-
inside the RNN and when we tune those weights, the RNN will have a different behavior in terms of
50-
how its state evolves, as it receives these inputs. Usually we are also interested in producing an
51-
output based on the RNN state, so we can produce these output vectors on top of the RNN (as depicted
52-
in Figure 2).
50+
### Why are existing convnets insufficient?
51+
The existing convnets are insufficient to deal with tasks that have inputs and outputs with variable sequence lengths.
52+
In the example of video captioning, inputs have variable number of frames (e.g. 10-minute and 10-hour long video) and outputs are captions
53+
of variable length. Convnets can only take in inputs with a fixed size of width and height and cannot generalize over
54+
inputs with different sizes. In order to tackle this problem, we introduce Recurrent Neural Networks (RNNs).
55+
56+
### Recurrent Neural Network
57+
RNN is basically a blackbox (Left of Figure 2), where it has an “internal state” that is updated as a sequence is processed. At every single timestep, we feed in an input vector into RNN where it modifies that state as a function of what it receives. When we tune RNN weights,
58+
RNN will show different behaviors in terms of how its state evolves as it receives these inputs.
59+
We are also interested in producing an output based on the RNN state, so we can produce these output vectors on top of the RNN (as depicted in Figure 2.
60+
61+
If we unroll an RNN model (Right of Figure 2), then there are inputs (e.g. video frame) at different timesteps shown as $$x_1, x_2, x_3$$ ... $$x_t$$.
62+
RNN at each timestep takes in two inputs -- an input frame ($$x_i$$) and previous representation of what it seems so far (i.e. history) -- to generate an output $$y_i$$ and update its history, which will get forward propagated over time. All the RNN blocks in Figure 2 (Right) are the same block that share the same parameter, but have different inputs and history at each timestep.
5363

5464
<div class="fig figcenter fighighlight">
55-
<img src="/assets/rnn/rnn_blackbox.png" width="20%" >
56-
<div class="figcaption">Figure 2. Simplified RNN box.</div>
65+
<img src="/assets/rnn/rnn_blackbox.png" width="16%" >
66+
<img src="/assets/rnn/unrolledRNN.png" width="60%" >
67+
<div class="figcaption"><b>Figure 2. </b>Simplified RNN box (Left) and Unrolled RNN (Right).</div>
5768
</div>
5869

5970
More precisely, RNN can be represented as a recurrence formula of some function $$f_W$$ with
@@ -65,8 +76,7 @@ $$
6576

6677
where at every timestep it receives some previous state as a vector $$h_{t-1}$$ of previous
6778
iteration timestep $$t-1$$ and current input vector $$x_t$$ to produce the current state as a vector
68-
$$h_t$$. A fixed function $$f_W$$ with
69-
weights $$W$$ is applied at every single timestep and that allows us to use
79+
$$h_t$$. A fixed function $$f_W$$ with weights $$W$$ is applied at every single timestep and that allows us to use
7080
the Recurrent Neural Network on sequences without having to commit to the size of the sequence because
7181
we apply the exact same function at every single timestep, no matter how long the input or output
7282
sequences are.
@@ -196,3 +206,32 @@ are trained jointly, and the diagram in Figure 4 represents one computational gr
196206
So far we have seen only a simple recurrence formula for the Vanilla RNN. In practice, we actually will
197207
rarely ever use Vanilla RNN formula. Instead, we will use what we call a Long-Short Term Memory (LSTM)
198208
RNN.
209+
210+
### Vanilla RNN Gradient Flow
211+
An RNN block takes in input $$x_t$$ and previous hidden representation $$h_{t-1}$$ and learn a transformation, which is then passed through tanh to produce the hidden representation $$h_{t}$$ for the next time step and output $$y_{t}$$ as shown in the equation below.
212+
213+
$$ h_t = tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$
214+
215+
For the back propagation, Let's examine how the output at the very last timestep affects the weights at the very first time step.
216+
The partial derivative of $$h_t$$ with respect to $$h_{t-1}$$ is written as:
217+
$$ \frac{\partial h_t}{\partial h_{t-1}} = tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)W_{hh} $$
218+
219+
We update the weights $$W$$ by getting the derivative of the loss at the very last time step $$L_{t}$$ with respect to $$W$$.
220+
\begin{aligned}
221+
\frac{\partial L_{t}}{\partial W} = \frac{\partial L_{t}}{\partial h_{t}} \frac{\partial h_{t}}{\partial h_{t-1} } \dots \frac{\partial h_{1}}{\partial W} } \\
222+
= \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} \frac{\partial h_{t}}{\partial h_{t-1}})\frac{\partial h_{1}}{\partial W} \\
223+
= \frac{\partial L_{t}}{\partial h_{t}}(\prod_{t=2}^{T} tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)W_{hh}^{T-1})\frac{\partial h_{1}}{\partial W}
224+
\end{aligned}
225+
$$
226+
* Vanishing gradient: We see that $$tanh^{'}(W_{hh}h_{t-1} + W_{xh}x_t)$$ will almost always be less than 1 because tanh is always between negative one and one.
227+
Thus, as $$t$$ gets larger (i.e. longer timesteps), the gradient ($$\frac{\partial L_{t}}{\partial W} $$) will descrease in value and get close to zero.
228+
This will lead to vanishing gradient problem, where gradients at future time steps rarely impact gradients at the very first time step. This is problematic when we model long sequence of inputs because the updates will be extremely slow.
229+
230+
231+
Now, of course, you might
232+
ask, what if we just got rid of this nonlinearity?
233+
234+
235+
### Gradient Flow
236+
237+
### Vanishing gradient problem

0 commit comments

Comments
 (0)