|
| 1 | +--- |
| 2 | +layout: page |
| 3 | +permalink: /attention/ |
| 4 | +--- |
| 5 | + |
| 6 | +Table of Contents: |
| 7 | + |
| 8 | +- [Motivation](#motivation) |
| 9 | +- [General Attention Layers](#attention) |
| 10 | + - [Operations](#operations) |
| 11 | +- [Self-Attention](#self) |
| 12 | + - [Masked Self-Attention Layers](#masked) |
| 13 | + - [Multi-Head Self-Attention Layers](#multihead) |
| 14 | +- [Summary](#summary) |
| 15 | +- [Additional References](#resources) |
| 16 | + |
| 17 | +## Attention |
| 18 | + |
| 19 | +We discussed fundamental workhorses of modern deep learning such as Convolutional Neural Networks and Recurrent Neural |
| 20 | +Networks in previous sections. This section is devoted to yet another layer -- the attention layer -- that forms a new |
| 21 | +primitive for modern Computer Vision and NLP applications. |
| 22 | + |
| 23 | +<a name='motivation'></a> |
| 24 | + |
| 25 | +### Motivation |
| 26 | + |
| 27 | +To motivate the attention layer, let us look at a sample application -- image captioning, and see what's the problem |
| 28 | +with using plain CNNs and RNNs there. |
| 29 | + |
| 30 | +The figure below shows a pipeline of applying such networks on a given image to generate a caption. It first uses a |
| 31 | +pre-trained CNN feature extractor to summarize the image, resulting in an image feature vector \\(c = h_0\\). It then |
| 32 | +applies a recurrent network to repeatedly generate tokens at each step. After five time steps, the image captioning |
| 33 | +model obtains the sentence: "surfer riding on wave". |
| 34 | + |
| 35 | +<div class="fig figcenter fighighlight"> |
| 36 | + <img src="/assets/att/captioning.png" width="80%"> |
| 37 | +</div> |
| 38 | + |
| 39 | +What is the problem here? Notice that the model relies entirely on the context vector \\(c\\) to write the caption -- |
| 40 | +everything it wants to say about the image needs to be compressed within this vector. What if we want to be very |
| 41 | +specific, and describe every nitty-gritty detail of the image, e.g. color of the surfer's shirt, facing direction of the |
| 42 | +waves? Obviously, a finite-length vector cannot be used to encode all such possibilities, especially if the desired |
| 43 | +number of tokens goes to the magnitude of hundreds or thousands. |
| 44 | + |
| 45 | +The central idea of the attention layer is borrowed from human's visual attention system: when humans like us are given |
| 46 | +a visual scene and try to understand a specific region of that scene, we focus our eyesight on that region. The |
| 47 | +attention layer simulates this process, and *attends* to different parts of the image while generating words to describe |
| 48 | +it. |
| 49 | + |
| 50 | +With attention in play, a similar diagram showing the pipeline for image captioning is as follows. What's the main |
| 51 | +difference? We incorporate two additional matrices: one for *alignment scores*, and the other for *attention*; and have |
| 52 | +*different context vectors* \\(c_i\\) at different steps. At each step, the model uses a multi-layer perceptron to |
| 53 | +digest the current hidden vector \\(h_i\\) and the input image features, to generate an alignment score matrix of shape |
| 54 | +\\(H \times W\\). This score matrix is then fed into a softmax layer that converts it to an attention matrix with |
| 55 | +weights summing to one. The weights in the attention matrix are next multiplied element-wise with image features, |
| 56 | +allowing the model to focus on regions of the image differently. This entire process is differentiable and enables the |
| 57 | +model to choose its own attention weights. |
| 58 | + |
| 59 | +<div class="fig figcenter fighighlight"> |
| 60 | + <img src="/assets/att/captioning-attention.png" width="60%"> |
| 61 | +</div> |
| 62 | + |
| 63 | +<a name='attention'></a> |
| 64 | + |
| 65 | +### General Attention Layers |
| 66 | + |
| 67 | +While the previous section details the application of an attention layer in image captioning, we next present a more |
| 68 | +general and principled formulation of the attention layer, de-contextualizing it from the image captioning and recurrent |
| 69 | +network settings. In a general setting, the attention layer is a layer with input and output vectors, and five major |
| 70 | +operations. These are illustrated in the following diagrams. |
| 71 | + |
| 72 | +<div class="fig figcenter fighighlight"> |
| 73 | + <img src="/assets/att/attention.png" width="70%"> |
| 74 | + <div class="figcaption">Left: A General Attention Layer. Right: A Self-Attention Layer.</div> |
| 75 | +</div> |
| 76 | + |
| 77 | +As illustrated, inputs to an attention layer contain input vectors \\(X\\) and query vectors \\(Q\\). The input vectors, |
| 78 | +\\(X\\), are of shape \\(N \times D_x\\) while the query vectors \\(Q\\) are of shape \\(M \times D_k\\). In the image |
| 79 | +captioning example, input vectors are the image features while query vectors are the hidden states of the recurrent |
| 80 | +network. Outputs of an attention layer are the vectors \\(Y\\) of shape \\(M \times D_k\\), at the top. |
| 81 | + |
| 82 | +The bulk of the attention operations are illustrated as the colorful grids in the middle, and contains two major types |
| 83 | +of operations: linear key-value maps; and align & attend operations that we saw earlier in the image captioning example. |
| 84 | + |
| 85 | +<a name='operations'></a> |
| 86 | + |
| 87 | +#### Operations |
| 88 | + |
| 89 | +**Linear Key and Value Transformations.** These operations are linear transformations that convert the input vectors \\( |
| 90 | +X\\) to two alternative set of vectors: |
| 91 | + |
| 92 | +- Key vectors \\(K\\): These vectors are obtained by using the linear equation \\(K = X W_k\\) where \\(W_k\\) is a |
| 93 | + learnable weight matrix of shape \\(D_x \times D_k\\), converting from input vector dimension \\(D_x\\) to key |
| 94 | + dimension \\(D_k\\). The resulting keys have the same dimension as the query vectors, to enable alignment. |
| 95 | +- Value vectors \\(V\\): Similarly, the equation to derive these vectors is the linear rule \\(V = X W_v\\) |
| 96 | + where \\(W_v\\) is of shape \\(D_x \times D_v\\). The value vectors have the same dimension as the output vectors. |
| 97 | + |
| 98 | +By applying these fully-connected layers on top of the inputs, the attention model achieves additional expressivity. |
| 99 | + |
| 100 | +**Alignment.** Core to the attention layer are two fundamental operations: alignment, and attention. In the alignment |
| 101 | +step, while more complex functions are possible, practitioners often opt for a simple function between vectors: pairwise |
| 102 | +dot products between key and query vectors. |
| 103 | + |
| 104 | +Moreover, for vectors with a larger dimensionality, more terms are multiplied and summed in the dot product and this |
| 105 | +usually implies a larger variance. Vectors with larger magnitude contribute more weights to the resulting softmax |
| 106 | +calculation, and many terms usually receive low attention. To deal with this issue, a scaling factor, the reciprocal of |
| 107 | +\\(\sqrt{D_x}\\), is often incorporated to reduce the alignment scores. This scaling procedure reduces the effect of |
| 108 | +large magnitude terms, so that the resulting attention weights are more spread-out. The alignment computation can be |
| 109 | +summarized as the following equation: |
| 110 | + |
| 111 | +$$ e_{i,j} = \frac{q_j \cdot x_i}{\sqrt{D_x}} $$ |
| 112 | + |
| 113 | +**Attention.** The attention matrix is obtained by applying the softmax function column-wise to the alignment matrix. |
| 114 | + |
| 115 | +$$ \mathbf{a} = \text{softmax}(\mathbf{e}) $$ |
| 116 | + |
| 117 | +The output vectors are finally calculated as multiplications of the attention matrix and the input vectors: |
| 118 | + |
| 119 | +$$ y_j = \sum_{i} a_{i,j} x_i $$ |
| 120 | + |
| 121 | +<a name='self'></a> |
| 122 | + |
| 123 | +### Self-Attention |
| 124 | + |
| 125 | +While we explain the general attention layer above, the self-attention layer refers to the special case where similar to |
| 126 | +the key and value vectors, the query vectors \\(Q\\) are also expressed as a linear transformation of the input vectors: |
| 127 | +\\(Q = X W_q\\) where \\(W_q\\) is of shape \\(D_x \times D_k\\). With this expression of query vectors as a linear |
| 128 | +function of the inputs, the attention layer is self-contained. This is illustrated on the right of the figure above. |
| 129 | + |
| 130 | +**Permutation Invariance.** It is worth noting that the self-attention layer is invariant to the order of the input |
| 131 | +vectors: if we apply a permutation to the input vectors, the outputs will be permuted in exactly the same way. This is |
| 132 | +illustrated in the following diagram. |
| 133 | + |
| 134 | +<div class="fig figcenter fighighlight"> |
| 135 | + <img src="/assets/att/permutation.png" width="55%"> |
| 136 | + <div class="figcaption">Permutation Invariance of Self-Attention Layers.</div> |
| 137 | +</div> |
| 138 | + |
| 139 | +**Positional Encoding.** While the self-attention layer is agnostic to the ordering of inputs, practical applications |
| 140 | +often require some notion of ordering. For example, in natural language sequences, the relative ordering of words often |
| 141 | +plays a pivotal role in differentiating the meaning of the entire sentence. This necessitates the inclusion of a |
| 142 | +positional encoding component into the self-attention module, to endow the model with the ability to determine the |
| 143 | +positions of its inputs. A number of desiderata are needed for this component: |
| 144 | + |
| 145 | +- The positional encodings should be *unique* for each time step. |
| 146 | +- The *distance* between any two consecutive encodings should be the same. |
| 147 | +- The positional encoding function should generalize to arbitrarily *long* sequences. |
| 148 | +- The function should be *deterministic*. |
| 149 | + |
| 150 | +While there exists a number of functions that satisfy the above criteria, a commonly used method makes use of mixed sine |
| 151 | +and cosine values. Concretely, the encoding function looks like the following: |
| 152 | + |
| 153 | +$$ p(t) |
| 154 | += [\sin(w_1 \cdot t), \cos(w_1 \cdot t), \sin(w_2 \cdot t), \cos(w_2 \cdot t), \cdots, \sin(w_{d/2} \cdot t), \cos(w_{d/2} \cdot t)] |
| 155 | +$$ |
| 156 | + |
| 157 | +where the frequency \\(w_k = \frac{1}{10000^{2k/d}}\\). What does this function encode? The following diagram is an |
| 158 | +intuitive explanation of the same phenomenon, but in the binary domain: |
| 159 | + |
| 160 | +<div class="fig figcenter fighighlight"> |
| 161 | + <img src="/assets/att/position-binary.png" width="50%"> |
| 162 | +</div> |
| 163 | + |
| 164 | +The frequencies \\(w_k\\) are varied, to represent the relative positions of the inputs, in a similar vein as the 0s and |
| 165 | +1s in the binary case. In practice, the positional encoding component concatenates additional information to the input |
| 166 | +vectors, before they are passed to the self-attention module: |
| 167 | + |
| 168 | +<div class="fig figcenter fighighlight"> |
| 169 | + <img src="/assets/att/position.png" width="15%"> |
| 170 | +</div> |
| 171 | + |
| 172 | +**A Comparison Between General Attention vs. Self-Attention**. The general attention layer has access to three sets of |
| 173 | +vectors: key, value, and query vectors. In comparison, the self-attention layer is entirely self-enclosed, and instead |
| 174 | +parameterizes the three sets of vectors as linear functions of the inputs. |
| 175 | + |
| 176 | +<div class="fig figcenter fighighlight"> |
| 177 | + <img src="/assets/att/comparison.png" width="60%"> |
| 178 | +</div> |
| 179 | + |
| 180 | +<a name='masked'></a> |
| 181 | + |
| 182 | +#### Masked Self-Attention Layers |
| 183 | + |
| 184 | +While the positional encoding layer integrates some positional information, in more critical applications, it may be |
| 185 | +necessary to distill into the model a clearer idea of relative input orderings and prevent it from *looking-ahead* at |
| 186 | +future vectors. To this end, the *masked* self-attention layer is created: it explicitly sets the lower-triangular part |
| 187 | +of the alignment matrix to negative infinity values, to ignore the corresponding, future vectors while the model |
| 188 | +processes earlier vectors. |
| 189 | + |
| 190 | +<div class="fig figcenter fighighlight"> |
| 191 | + <img src="/assets/att/masked.png" width="30%"> |
| 192 | +</div> |
| 193 | + |
| 194 | +<a name='multihead'></a> |
| 195 | + |
| 196 | +#### Multi-Head Self-Attention Layers |
| 197 | + |
| 198 | +Yet another possibility to increase the expressivity of the model is to exploit the notion of a *multi-head* attention. |
| 199 | +Instead of using one single self-attention layer, multi-head attention utilizes multiple, parallel attention layers. In |
| 200 | +some cases, to maintain the total computation, the key and value dimensions \\(D_k, D_v\\) may be reduced accordingly. |
| 201 | +The benefit of using multiple attention heads is to allow the model to focus on different aspects of the input vectors. |
| 202 | + |
| 203 | +<div class="fig figcenter fighighlight"> |
| 204 | + <img src="/assets/att/multihead.png" width="60%"> |
| 205 | +</div> |
| 206 | + |
| 207 | +<a name='summary'></a> |
| 208 | + |
| 209 | +### Summary |
| 210 | + |
| 211 | +To summarize this section, |
| 212 | + |
| 213 | +- We motivated and introduced a novel layer popular in deep learning, the **attention** layer. |
| 214 | +- We introduced it in its general formulation and in particular, studied details of the **align and attend** operations. |
| 215 | +- We then specialized to the case of a **self-attention** layer. |
| 216 | +- We learned that self-attention layers are **permutation-invariant** to the input vectors. |
| 217 | +- To retain some positional information, self-attention layers use a **positional-encoding** function. |
| 218 | +- Moreover, we also studied two extensions of the vanilla self-attention layer: the **masked** attention layer, and |
| 219 | + the **multi-head** attention. While the former layer prevents the model from looking ahead, the latter serves to |
| 220 | + increase its expressivity. |
| 221 | + |
| 222 | +<a name='resources'></a> |
| 223 | + |
| 224 | +### Additional Resources |
| 225 | + |
| 226 | +- [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](http://proceedings.mlr.press/v37/xuc15.pdf) |
| 227 | + presents an application of the attention layer to image captioning. |
| 228 | +- [Women also Snowboard: Overcoming Bias in Captioning Models](https://arxiv.org/pdf/1803.09797.pdf) exploits the |
| 229 | + attention layer to detect gender bias in image captioning models. |
| 230 | +- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf) applies |
| 231 | + attention to natural language translation. |
| 232 | +- [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf) is the seminal paper on attention-based |
| 233 | + Transformers, that took the Vision and NLP communities by storm. |
0 commit comments