Skip to content

Commit cd22d7f

Browse files
committed
Added course notes on attention
1 parent 7f027e5 commit cd22d7f

File tree

10 files changed

+233
-0
lines changed

10 files changed

+233
-0
lines changed

assets/att/attention.png

205 KB
Loading
568 KB
Loading

assets/att/captioning.png

553 KB
Loading

assets/att/comparison.png

49 KB
Loading

assets/att/masked.png

97.9 KB
Loading

assets/att/multihead.png

84.9 KB
Loading

assets/att/permutation.png

39.5 KB
Loading

assets/att/position-binary.png

36.3 KB
Loading

assets/att/position.png

37.4 KB
Loading

attention.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
---
2+
layout: page
3+
permalink: /attention/
4+
---
5+
6+
Table of Contents:
7+
8+
- [Motivation](#motivation)
9+
- [General Attention Layers](#attention)
10+
- [Operations](#operations)
11+
- [Self-Attention](#self)
12+
- [Masked Self-Attention Layers](#masked)
13+
- [Multi-Head Self-Attention Layers](#multihead)
14+
- [Summary](#summary)
15+
- [Additional References](#resources)
16+
17+
## Attention
18+
19+
We discussed fundamental workhorses of modern deep learning such as Convolutional Neural Networks and Recurrent Neural
20+
Networks in previous sections. This section is devoted to yet another layer -- the attention layer -- that forms a new
21+
primitive for modern Computer Vision and NLP applications.
22+
23+
<a name='motivation'></a>
24+
25+
### Motivation
26+
27+
To motivate the attention layer, let us look at a sample application -- image captioning, and see what's the problem
28+
with using plain CNNs and RNNs there.
29+
30+
The figure below shows a pipeline of applying such networks on a given image to generate a caption. It first uses a
31+
pre-trained CNN feature extractor to summarize the image, resulting in an image feature vector \\(c = h_0\\). It then
32+
applies a recurrent network to repeatedly generate tokens at each step. After five time steps, the image captioning
33+
model obtains the sentence: "surfer riding on wave".
34+
35+
<div class="fig figcenter fighighlight">
36+
<img src="/assets/att/captioning.png" width="80%">
37+
</div>
38+
39+
What is the problem here? Notice that the model relies entirely on the context vector \\(c\\) to write the caption --
40+
everything it wants to say about the image needs to be compressed within this vector. What if we want to be very
41+
specific, and describe every nitty-gritty detail of the image, e.g. color of the surfer's shirt, facing direction of the
42+
waves? Obviously, a finite-length vector cannot be used to encode all such possibilities, especially if the desired
43+
number of tokens goes to the magnitude of hundreds or thousands.
44+
45+
The central idea of the attention layer is borrowed from human's visual attention system: when humans like us are given
46+
a visual scene and try to understand a specific region of that scene, we focus our eyesight on that region. The
47+
attention layer simulates this process, and *attends* to different parts of the image while generating words to describe
48+
it.
49+
50+
With attention in play, a similar diagram showing the pipeline for image captioning is as follows. What's the main
51+
difference? We incorporate two additional matrices: one for *alignment scores*, and the other for *attention*; and have
52+
*different context vectors* \\(c_i\\) at different steps. At each step, the model uses a multi-layer perceptron to
53+
digest the current hidden vector \\(h_i\\) and the input image features, to generate an alignment score matrix of shape
54+
\\(H \times W\\). This score matrix is then fed into a softmax layer that converts it to an attention matrix with
55+
weights summing to one. The weights in the attention matrix are next multiplied element-wise with image features,
56+
allowing the model to focus on regions of the image differently. This entire process is differentiable and enables the
57+
model to choose its own attention weights.
58+
59+
<div class="fig figcenter fighighlight">
60+
<img src="/assets/att/captioning-attention.png" width="60%">
61+
</div>
62+
63+
<a name='attention'></a>
64+
65+
### General Attention Layers
66+
67+
While the previous section details the application of an attention layer in image captioning, we next present a more
68+
general and principled formulation of the attention layer, de-contextualizing it from the image captioning and recurrent
69+
network settings. In a general setting, the attention layer is a layer with input and output vectors, and five major
70+
operations. These are illustrated in the following diagrams.
71+
72+
<div class="fig figcenter fighighlight">
73+
<img src="/assets/att/attention.png" width="70%">
74+
<div class="figcaption">Left: A General Attention Layer. Right: A Self-Attention Layer.</div>
75+
</div>
76+
77+
As illustrated, inputs to an attention layer contain input vectors \\(X\\) and query vectors \\(Q\\). The input vectors,
78+
\\(X\\), are of shape \\(N \times D_x\\) while the query vectors \\(Q\\) are of shape \\(M \times D_k\\). In the image
79+
captioning example, input vectors are the image features while query vectors are the hidden states of the recurrent
80+
network. Outputs of an attention layer are the vectors \\(Y\\) of shape \\(M \times D_k\\), at the top.
81+
82+
The bulk of the attention operations are illustrated as the colorful grids in the middle, and contains two major types
83+
of operations: linear key-value maps; and align & attend operations that we saw earlier in the image captioning example.
84+
85+
<a name='operations'></a>
86+
87+
#### Operations
88+
89+
**Linear Key and Value Transformations.** These operations are linear transformations that convert the input vectors \\(
90+
X\\) to two alternative set of vectors:
91+
92+
- Key vectors \\(K\\): These vectors are obtained by using the linear equation \\(K = X W_k\\) where \\(W_k\\) is a
93+
learnable weight matrix of shape \\(D_x \times D_k\\), converting from input vector dimension \\(D_x\\) to key
94+
dimension \\(D_k\\). The resulting keys have the same dimension as the query vectors, to enable alignment.
95+
- Value vectors \\(V\\): Similarly, the equation to derive these vectors is the linear rule \\(V = X W_v\\)
96+
where \\(W_v\\) is of shape \\(D_x \times D_v\\). The value vectors have the same dimension as the output vectors.
97+
98+
By applying these fully-connected layers on top of the inputs, the attention model achieves additional expressivity.
99+
100+
**Alignment.** Core to the attention layer are two fundamental operations: alignment, and attention. In the alignment
101+
step, while more complex functions are possible, practitioners often opt for a simple function between vectors: pairwise
102+
dot products between key and query vectors.
103+
104+
Moreover, for vectors with a larger dimensionality, more terms are multiplied and summed in the dot product and this
105+
usually implies a larger variance. Vectors with larger magnitude contribute more weights to the resulting softmax
106+
calculation, and many terms usually receive low attention. To deal with this issue, a scaling factor, the reciprocal of
107+
\\(\sqrt{D_x}\\), is often incorporated to reduce the alignment scores. This scaling procedure reduces the effect of
108+
large magnitude terms, so that the resulting attention weights are more spread-out. The alignment computation can be
109+
summarized as the following equation:
110+
111+
$$ e_{i,j} = \frac{q_j \cdot x_i}{\sqrt{D_x}} $$
112+
113+
**Attention.** The attention matrix is obtained by applying the softmax function column-wise to the alignment matrix.
114+
115+
$$ \mathbf{a} = \text{softmax}(\mathbf{e}) $$
116+
117+
The output vectors are finally calculated as multiplications of the attention matrix and the input vectors:
118+
119+
$$ y_j = \sum_{i} a_{i,j} x_i $$
120+
121+
<a name='self'></a>
122+
123+
### Self-Attention
124+
125+
While we explain the general attention layer above, the self-attention layer refers to the special case where similar to
126+
the key and value vectors, the query vectors \\(Q\\) are also expressed as a linear transformation of the input vectors:
127+
\\(Q = X W_q\\) where \\(W_q\\) is of shape \\(D_x \times D_k\\). With this expression of query vectors as a linear
128+
function of the inputs, the attention layer is self-contained. This is illustrated on the right of the figure above.
129+
130+
**Permutation Invariance.** It is worth noting that the self-attention layer is invariant to the order of the input
131+
vectors: if we apply a permutation to the input vectors, the outputs will be permuted in exactly the same way. This is
132+
illustrated in the following diagram.
133+
134+
<div class="fig figcenter fighighlight">
135+
<img src="/assets/att/permutation.png" width="55%">
136+
<div class="figcaption">Permutation Invariance of Self-Attention Layers.</div>
137+
</div>
138+
139+
**Positional Encoding.** While the self-attention layer is agnostic to the ordering of inputs, practical applications
140+
often require some notion of ordering. For example, in natural language sequences, the relative ordering of words often
141+
plays a pivotal role in differentiating the meaning of the entire sentence. This necessitates the inclusion of a
142+
positional encoding component into the self-attention module, to endow the model with the ability to determine the
143+
positions of its inputs. A number of desiderata are needed for this component:
144+
145+
- The positional encodings should be *unique* for each time step.
146+
- The *distance* between any two consecutive encodings should be the same.
147+
- The positional encoding function should generalize to arbitrarily *long* sequences.
148+
- The function should be *deterministic*.
149+
150+
While there exists a number of functions that satisfy the above criteria, a commonly used method makes use of mixed sine
151+
and cosine values. Concretely, the encoding function looks like the following:
152+
153+
$$ p(t)
154+
= [\sin(w_1 \cdot t), \cos(w_1 \cdot t), \sin(w_2 \cdot t), \cos(w_2 \cdot t), \cdots, \sin(w_{d/2} \cdot t), \cos(w_{d/2} \cdot t)]
155+
$$
156+
157+
where the frequency \\(w_k = \frac{1}{10000^{2k/d}}\\). What does this function encode? The following diagram is an
158+
intuitive explanation of the same phenomenon, but in the binary domain:
159+
160+
<div class="fig figcenter fighighlight">
161+
<img src="/assets/att/position-binary.png" width="50%">
162+
</div>
163+
164+
The frequencies \\(w_k\\) are varied, to represent the relative positions of the inputs, in a similar vein as the 0s and
165+
1s in the binary case. In practice, the positional encoding component concatenates additional information to the input
166+
vectors, before they are passed to the self-attention module:
167+
168+
<div class="fig figcenter fighighlight">
169+
<img src="/assets/att/position.png" width="15%">
170+
</div>
171+
172+
**A Comparison Between General Attention vs. Self-Attention**. The general attention layer has access to three sets of
173+
vectors: key, value, and query vectors. In comparison, the self-attention layer is entirely self-enclosed, and instead
174+
parameterizes the three sets of vectors as linear functions of the inputs.
175+
176+
<div class="fig figcenter fighighlight">
177+
<img src="/assets/att/comparison.png" width="60%">
178+
</div>
179+
180+
<a name='masked'></a>
181+
182+
#### Masked Self-Attention Layers
183+
184+
While the positional encoding layer integrates some positional information, in more critical applications, it may be
185+
necessary to distill into the model a clearer idea of relative input orderings and prevent it from *looking-ahead* at
186+
future vectors. To this end, the *masked* self-attention layer is created: it explicitly sets the lower-triangular part
187+
of the alignment matrix to negative infinity values, to ignore the corresponding, future vectors while the model
188+
processes earlier vectors.
189+
190+
<div class="fig figcenter fighighlight">
191+
<img src="/assets/att/masked.png" width="30%">
192+
</div>
193+
194+
<a name='multihead'></a>
195+
196+
#### Multi-Head Self-Attention Layers
197+
198+
Yet another possibility to increase the expressivity of the model is to exploit the notion of a *multi-head* attention.
199+
Instead of using one single self-attention layer, multi-head attention utilizes multiple, parallel attention layers. In
200+
some cases, to maintain the total computation, the key and value dimensions \\(D_k, D_v\\) may be reduced accordingly.
201+
The benefit of using multiple attention heads is to allow the model to focus on different aspects of the input vectors.
202+
203+
<div class="fig figcenter fighighlight">
204+
<img src="/assets/att/multihead.png" width="60%">
205+
</div>
206+
207+
<a name='summary'></a>
208+
209+
### Summary
210+
211+
To summarize this section,
212+
213+
- We motivated and introduced a novel layer popular in deep learning, the **attention** layer.
214+
- We introduced it in its general formulation and in particular, studied details of the **align and attend** operations.
215+
- We then specialized to the case of a **self-attention** layer.
216+
- We learned that self-attention layers are **permutation-invariant** to the input vectors.
217+
- To retain some positional information, self-attention layers use a **positional-encoding** function.
218+
- Moreover, we also studied two extensions of the vanilla self-attention layer: the **masked** attention layer, and
219+
the **multi-head** attention. While the former layer prevents the model from looking ahead, the latter serves to
220+
increase its expressivity.
221+
222+
<a name='resources'></a>
223+
224+
### Additional Resources
225+
226+
- [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](http://proceedings.mlr.press/v37/xuc15.pdf)
227+
presents an application of the attention layer to image captioning.
228+
- [Women also Snowboard: Overcoming Bias in Captioning Models](https://arxiv.org/pdf/1803.09797.pdf) exploits the
229+
attention layer to detect gender bias in image captioning models.
230+
- [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/pdf/1409.0473.pdf) applies
231+
attention to natural language translation.
232+
- [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf) is the seminal paper on attention-based
233+
Transformers, that took the Vision and NLP communities by storm.

0 commit comments

Comments
 (0)