|
| 1 | +PixelRNN |
| 2 | +======== |
| 3 | + |
| 4 | +We now give a brief overview of PixelRNN. PixelRNNs belongs to a family |
| 5 | +of explicit density models called **fully visible belief networks |
| 6 | +(FVBN)**. We can represent our model with the following equation: |
| 7 | +$$p(x) = p(x_1, x_2, \dots, x_n),$$ where the left hand side $p(x)$ |
| 8 | +represents the likelihood of an entire image $x$, and the right hand |
| 9 | +side represents the joint likelihood of each pixel in the image. Using |
| 10 | +the Chain Rule, we can decompose this likelihood into a product of |
| 11 | +1-dimensional distributions: |
| 12 | +$$p(x) = \prod_{i = 1}^n p(x_i \mid x_1, \dots, x_{i - 1}).$$ Maximizing |
| 13 | +the likelihood of training data, we obtain our models PixelRNN. |
| 14 | + |
| 15 | +Introduction |
| 16 | +============ |
| 17 | + |
| 18 | +PixelRNN, first introduced in van der Oord et al. 2016, uses an RNN-like |
| 19 | +structure, modeling the pixels one-by-one, to maximize the likelihood |
| 20 | +function given above. One of the more difficult tasks in generative |
| 21 | +modeling is to create a model that is tractable, and PixelRNN seeks to |
| 22 | +address that. It does so by tractably modeling a joint distribution of |
| 23 | +the pixels in the image, casting it as a product of conditional |
| 24 | +distributions. The factorization turns the joint modeling problem into |
| 25 | +one that relates to sequences, i.e., we have to predict the next pixel |
| 26 | +given all the previously generated pixels. Thus, we use Recurrent Neural |
| 27 | +Networks for this tasks as they learn sequentially. Those same |
| 28 | +principles apply here; more precisely, we generate image pixels starting |
| 29 | +from the top left corner, and we model each pixel’s dependency on |
| 30 | +previous pixels using an RNN (LSTM). |
| 31 | + |
| 32 | +<div class="fig figcenter fighighlight"> |
| 33 | + <img src="/assets/pixelrnn.png"> |
| 34 | +</div> |
| 35 | + |
| 36 | +Specifically, the PixelRNN framework is made up of twelve |
| 37 | +two-dimensional LSTM layers, with convolutions applied to each dimension |
| 38 | +of the data. There are two types of layers here. One is the Row LSTM |
| 39 | +layer where the convolution is applied along each row. The second type |
| 40 | +is the Diagonal BiLSTM layer where the convolution is applied to the |
| 41 | +diagonals of the image. In addition, the pixel values are modeled as |
| 42 | +discrete values using a multinomial distribution implemented with a |
| 43 | +softmax layer. This is in contrast to many previous approaches, which |
| 44 | +model pixels as continuous values. |
| 45 | + |
| 46 | +Model |
| 47 | +===== |
| 48 | + |
| 49 | +The approach of the PixelRNN is as follows. The RNN scans the each |
| 50 | +individual pixel, going row-wise, predicting the conditional |
| 51 | +distribution over the possible pixel values given what context the |
| 52 | +network has. As mentioned before, PixelRNN uses a two-dimensional LSTM |
| 53 | +network which begins scanning at the top left of the image and makesits |
| 54 | +way to the bottom right. One of the reasons an LSTM is used is that it |
| 55 | +can better capture some longer range dependencies between pixels - this |
| 56 | +is essential for understanding image composition. The reason a |
| 57 | +two-dimensional structure is used is to ensure that the signals |
| 58 | +propagate in the left-to-right and top-to-bottom directions well.\ |
| 59 | + |
| 60 | +The input image to the network is represented by a 1D vector of pixel |
| 61 | +values $\{x_1,..., x_{n^2}\}$ for an $n$-by-$n$ sized image, where |
| 62 | +$\{x_1,..., x_{n}\}$ represents the pixels from the first row. Our goal |
| 63 | +is to use these pixel values to find a probability distribution $p(X)$ |
| 64 | +for each image $X$. We define this probability as: |
| 65 | +$$p(x) = \prod_{i = 1}^{n^2} p(x_i \mid x_1, \dots, x_{i - 1}).$$ |
| 66 | + |
| 67 | +This is the product of the conditional distributions across all the |
| 68 | +pixels in the image - for pixel $x_i$, we have |
| 69 | +$p(x_i \mid x_1, \dots, x_{i - 1})$. In turn, each of these conditional |
| 70 | +distributions is determined by three values, associated with each of the |
| 71 | +color channels present in the image (red, green and blue). In other |
| 72 | +words: |
| 73 | + |
| 74 | +$$p(x_i \mid x_1, \dots, x_{i - 1}) = p(x_{i,R} \mid \textbf{x}_{<i}) \cdot p(x_{i,G} \mid \textbf{x}_{<i}, x_{i,R}) \cdot p(x_{i,B} \mid \textbf{x}_{<i}, x_{i,R}, x_{i,G}).$$ |
| 75 | + |
| 76 | +In the next section we will see how these distributions are calculated |
| 77 | +and used within the Recurrent Neural Network framework proposed in |
| 78 | +PixelRNN. |
| 79 | + |
| 80 | +Architecture |
| 81 | +------------ |
| 82 | + |
| 83 | +As we have seen, there are two distinct components to the |
| 84 | +“two-dimensional” LSTM, the Row LSTM and the Diagonal BiLSTM. Figure 2 |
| 85 | +illustrates how each of these two LSTMs operates, when applied to an RGB |
| 86 | +image. |
| 87 | + |
| 88 | +<div class="fig figcenter fighighlight"> |
| 89 | + <img src="/assets/Screen Shot 2021-06-15 at 9.41.08 AM.png"> |
| 90 | +</div> |
| 91 | + |
| 92 | +**Row LSTM** is a unidirectional layer that processes the image row by |
| 93 | +row from top to bottom computing features for a whole row at once using |
| 94 | +a 1D convolution. As we can see in the image above, the Row LSTM |
| 95 | +captures a triangle-shaped context for a given pixel. An LSTM layer has |
| 96 | +an input-to-state component and a recurrent state-to-state component |
| 97 | +that together determine the four gates inside the LSTM core. In the Row |
| 98 | +LSTM, the input-to-state component is computed for the whole |
| 99 | +two-dimensional input map with a one-dimensional convolution, row-wise. |
| 100 | +The output of the convolution is a 4h × n × n tensor, where the first |
| 101 | +dimension represents the four gate vectors for each position in the |
| 102 | +input map (h here is the number of output feature maps). Below are the |
| 103 | +computations for this state-to-state component, using the previous |
| 104 | +hidden state ($h_{i-1}$) and previous cell state ($c_{i-1}$). |
| 105 | +$$[o_i, f_i, i_i, g_i] = \sigma(\textbf{K}^{ss} \circledast h_{i-1} + \textbf{K}^{is} \circledast \textbf{x}_{i})$$ |
| 106 | +$$c_i = f_i \odot c_{i-1} + i_i \odot g_i$$ |
| 107 | +$$h_i = o_i \odot \tanh(c_{i})$$ |
| 108 | + |
| 109 | +Here, $\textbf{x}_i$ is the row of the input representation and |
| 110 | +$\textbf{K}^{ss}$, $\textbf{K}^{is}$ are the kernel weights for |
| 111 | +state-to-state and input-to-state respectively. $g_i, o_i, f_i$ and |
| 112 | +$i_i$ are the content, output, forget and input gates. $\sigma$ |
| 113 | +represents the activation function (tanh activation for the content |
| 114 | +gate, and sigmoid for the rest of the gates).\ |
| 115 | + |
| 116 | +**Diagonal BiLSTM** The Diagonal BiLSTM is able to capture the entire |
| 117 | +image context by scanning along both diagonals of the image, for each |
| 118 | +direction of the LSTM. We first compute the input-to-state and |
| 119 | +state-to-state components of the layer. For each of the directions, the |
| 120 | +input-to-state component is simply a 1×1 convolution $K^{is}$, |
| 121 | +generating a $4h × n × n$ tensor (Here again the dimension represents |
| 122 | +the four gate vectors for each position in the input map where h is the |
| 123 | +number of output feature maps). The state-to-state is calculated using |
| 124 | +the $K^{ss}$ that has a kernel of size 2 × 1. This step takes the |
| 125 | +previous hidden and cell states, combines the contribution of the |
| 126 | +input-to-state component and produces the next hidden and cell states, |
| 127 | +as explained in the equations for Row LSTM above. We repeat this process |
| 128 | +for each of the two directions. |
| 129 | + |
| 130 | +Performance |
| 131 | +=========== |
| 132 | + |
| 133 | +When originally presented, the PixelRNN model’s performance was tested |
| 134 | +on some of the most prominent datasets in the computer vision space - |
| 135 | +ImageNet and CIFAR-10. The results in some cases were state-of-the-art. |
| 136 | +On the ImageNet data set, achieved an NLL score of 3.86 and 3.63 on the |
| 137 | +the 32x32 and 64x64 image sizes respectively. On CiFAR-10, it achievied |
| 138 | +a NLL score of 3.00, which was state-of-the-art at the time of |
| 139 | +publication. |
| 140 | + |
| 141 | +References |
| 142 | +========== |
| 143 | + |
| 144 | +1) CS231n Lecture 11 'Generative Modeling' |
| 145 | +2) Pixel Recurrent Neural Networks (Oord et. al.) 2016 |
0 commit comments