Skip to content

Commit 648689b

Browse files
Merge pull request #265 from Sidu28/master
Add files via upload
2 parents ac19a61 + b6a0836 commit 648689b

File tree

4 files changed

+145
-0
lines changed

4 files changed

+145
-0
lines changed
24.6 KB
Loading

assets/pixelrnn.png

48.3 KB
Loading

pixelrnn.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
PixelRNN
2+
========
3+
4+
We now give a brief overview of PixelRNN. PixelRNNs belongs to a family
5+
of explicit density models called **fully visible belief networks
6+
(FVBN)**. We can represent our model with the following equation:
7+
$$p(x) = p(x_1, x_2, \dots, x_n),$$ where the left hand side $p(x)$
8+
represents the likelihood of an entire image $x$, and the right hand
9+
side represents the joint likelihood of each pixel in the image. Using
10+
the Chain Rule, we can decompose this likelihood into a product of
11+
1-dimensional distributions:
12+
$$p(x) = \prod_{i = 1}^n p(x_i \mid x_1, \dots, x_{i - 1}).$$ Maximizing
13+
the likelihood of training data, we obtain our models PixelRNN.
14+
15+
Introduction
16+
============
17+
18+
PixelRNN, first introduced in van der Oord et al. 2016, uses an RNN-like
19+
structure, modeling the pixels one-by-one, to maximize the likelihood
20+
function given above. One of the more difficult tasks in generative
21+
modeling is to create a model that is tractable, and PixelRNN seeks to
22+
address that. It does so by tractably modeling a joint distribution of
23+
the pixels in the image, casting it as a product of conditional
24+
distributions. The factorization turns the joint modeling problem into
25+
one that relates to sequences, i.e., we have to predict the next pixel
26+
given all the previously generated pixels. Thus, we use Recurrent Neural
27+
Networks for this tasks as they learn sequentially. Those same
28+
principles apply here; more precisely, we generate image pixels starting
29+
from the top left corner, and we model each pixel’s dependency on
30+
previous pixels using an RNN (LSTM).
31+
32+
<div class="fig figcenter fighighlight">
33+
<img src="/assets/pixelrnn.png">
34+
</div>
35+
36+
Specifically, the PixelRNN framework is made up of twelve
37+
two-dimensional LSTM layers, with convolutions applied to each dimension
38+
of the data. There are two types of layers here. One is the Row LSTM
39+
layer where the convolution is applied along each row. The second type
40+
is the Diagonal BiLSTM layer where the convolution is applied to the
41+
diagonals of the image. In addition, the pixel values are modeled as
42+
discrete values using a multinomial distribution implemented with a
43+
softmax layer. This is in contrast to many previous approaches, which
44+
model pixels as continuous values.
45+
46+
Model
47+
=====
48+
49+
The approach of the PixelRNN is as follows. The RNN scans the each
50+
individual pixel, going row-wise, predicting the conditional
51+
distribution over the possible pixel values given what context the
52+
network has. As mentioned before, PixelRNN uses a two-dimensional LSTM
53+
network which begins scanning at the top left of the image and makesits
54+
way to the bottom right. One of the reasons an LSTM is used is that it
55+
can better capture some longer range dependencies between pixels - this
56+
is essential for understanding image composition. The reason a
57+
two-dimensional structure is used is to ensure that the signals
58+
propagate in the left-to-right and top-to-bottom directions well.\
59+
60+
The input image to the network is represented by a 1D vector of pixel
61+
values $\{x_1,..., x_{n^2}\}$ for an $n$-by-$n$ sized image, where
62+
$\{x_1,..., x_{n}\}$ represents the pixels from the first row. Our goal
63+
is to use these pixel values to find a probability distribution $p(X)$
64+
for each image $X$. We define this probability as:
65+
$$p(x) = \prod_{i = 1}^{n^2} p(x_i \mid x_1, \dots, x_{i - 1}).$$
66+
67+
This is the product of the conditional distributions across all the
68+
pixels in the image - for pixel $x_i$, we have
69+
$p(x_i \mid x_1, \dots, x_{i - 1})$. In turn, each of these conditional
70+
distributions is determined by three values, associated with each of the
71+
color channels present in the image (red, green and blue). In other
72+
words:
73+
74+
$$p(x_i \mid x_1, \dots, x_{i - 1}) = p(x_{i,R} \mid \textbf{x}_{<i}) \cdot p(x_{i,G} \mid \textbf{x}_{<i}, x_{i,R}) \cdot p(x_{i,B} \mid \textbf{x}_{<i}, x_{i,R}, x_{i,G}).$$
75+
76+
In the next section we will see how these distributions are calculated
77+
and used within the Recurrent Neural Network framework proposed in
78+
PixelRNN.
79+
80+
Architecture
81+
------------
82+
83+
As we have seen, there are two distinct components to the
84+
“two-dimensional” LSTM, the Row LSTM and the Diagonal BiLSTM. Figure 2
85+
illustrates how each of these two LSTMs operates, when applied to an RGB
86+
image.
87+
88+
<div class="fig figcenter fighighlight">
89+
<img src="/assets/Screen Shot 2021-06-15 at 9.41.08 AM.png">
90+
</div>
91+
92+
**Row LSTM** is a unidirectional layer that processes the image row by
93+
row from top to bottom computing features for a whole row at once using
94+
a 1D convolution. As we can see in the image above, the Row LSTM
95+
captures a triangle-shaped context for a given pixel. An LSTM layer has
96+
an input-to-state component and a recurrent state-to-state component
97+
that together determine the four gates inside the LSTM core. In the Row
98+
LSTM, the input-to-state component is computed for the whole
99+
two-dimensional input map with a one-dimensional convolution, row-wise.
100+
The output of the convolution is a 4h × n × n tensor, where the first
101+
dimension represents the four gate vectors for each position in the
102+
input map (h here is the number of output feature maps). Below are the
103+
computations for this state-to-state component, using the previous
104+
hidden state ($h_{i-1}$) and previous cell state ($c_{i-1}$).
105+
$$[o_i, f_i, i_i, g_i] = \sigma(\textbf{K}^{ss} \circledast h_{i-1} + \textbf{K}^{is} \circledast \textbf{x}_{i})$$
106+
$$c_i = f_i \odot c_{i-1} + i_i \odot g_i$$
107+
$$h_i = o_i \odot \tanh(c_{i})$$
108+
109+
Here, $\textbf{x}_i$ is the row of the input representation and
110+
$\textbf{K}^{ss}$, $\textbf{K}^{is}$ are the kernel weights for
111+
state-to-state and input-to-state respectively. $g_i, o_i, f_i$ and
112+
$i_i$ are the content, output, forget and input gates. $\sigma$
113+
represents the activation function (tanh activation for the content
114+
gate, and sigmoid for the rest of the gates).\
115+
116+
**Diagonal BiLSTM** The Diagonal BiLSTM is able to capture the entire
117+
image context by scanning along both diagonals of the image, for each
118+
direction of the LSTM. We first compute the input-to-state and
119+
state-to-state components of the layer. For each of the directions, the
120+
input-to-state component is simply a 1×1 convolution $K^{is}$,
121+
generating a $4h × n × n$ tensor (Here again the dimension represents
122+
the four gate vectors for each position in the input map where h is the
123+
number of output feature maps). The state-to-state is calculated using
124+
the $K^{ss}$ that has a kernel of size 2 × 1. This step takes the
125+
previous hidden and cell states, combines the contribution of the
126+
input-to-state component and produces the next hidden and cell states,
127+
as explained in the equations for Row LSTM above. We repeat this process
128+
for each of the two directions.
129+
130+
Performance
131+
===========
132+
133+
When originally presented, the PixelRNN model’s performance was tested
134+
on some of the most prominent datasets in the computer vision space -
135+
ImageNet and CIFAR-10. The results in some cases were state-of-the-art.
136+
On the ImageNet data set, achieved an NLL score of 3.86 and 3.63 on the
137+
the 32x32 and 64x64 image sizes respectively. On CiFAR-10, it achievied
138+
a NLL score of 3.00, which was state-of-the-art at the time of
139+
publication.
140+
141+
References
142+
==========
143+
144+
1) CS231n Lecture 11 'Generative Modeling'
145+
2) Pixel Recurrent Neural Networks (Oord et. al.) 2016
173 KB
Binary file not shown.

0 commit comments

Comments
 (0)