Skip to content

Commit 536c6e4

Browse files
committed
Added course notes on transformers
1 parent 648689b commit 536c6e4

File tree

7 files changed

+176
-0
lines changed

7 files changed

+176
-0
lines changed

assets/att/decoder.png

92.2 KB
Loading

assets/att/dotproduct.png

18.4 KB
Loading

assets/att/encoder.png

64.8 KB
Loading

assets/att/multiheadgraph.PNG

75 KB
Loading

assets/att/softmax.png

18.7 KB
Loading

assets/att/vkq.png

4.85 KB
Loading

transformers.md

Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
Table of Contents:
2+
3+
- [Transformers Overview](#overview)
4+
- [Why Transformers?](#why)
5+
- [Multi-Headed Attention](#multihead)
6+
- [Multi-Headed Attention Tips](#tips)
7+
- [Transformer Steps: Encoder-Decoder](#steps)
8+
9+
<a name='overview'></a>
10+
11+
### Transformer Overview
12+
13+
In ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762), Vaswani et al. introduced the Transformer, which
14+
introduces parallelism and enables models to learn long-range dependencies--thereby helping solve two key issues with
15+
RNNs: their slow speed of training and their difficulty in encoding long-range dependencies. Transformers are highly
16+
scalable and highly parallelizable, allowing for faster training, larger models, and better performance across vision
17+
and language tasks. Transformers are beginning to replace RNNs and LSTMs and may soon replace convolutions as well.
18+
19+
<a name='why'></a>
20+
21+
### Why Transformers?
22+
23+
- Transformers are great for working with long input sequences since the attention calculation looks at all inputs. In
24+
contrast, RNNs struggle to encode long-range dependencies. LSTMs are much better at capturing long-range dependencies
25+
by using the input, output, and forget gates.
26+
- Transformers can operate over unordered sets or ordered sequences with positional encodings (using positional encoding
27+
to add ordering the sets). In contrast, RNN/LSTM expect an ordered sequence of inputs.
28+
- Transformers use parallel computation where all alignment and attention scores for all inputs can be done in parallel.
29+
In contrast, RNN/LSTM uses sequential computation since the hidden state at a current timestep can only be computed
30+
after the previous states are calculated which makes them often slow to train.
31+
32+
<a name='multihead'></a>
33+
34+
### Multi-Headed Attention
35+
36+
Let’s refresh our concepts from the attention unit to help us with transformers.
37+
<br>
38+
39+
- **Dot-Product Attention:**
40+
41+
<div class="fig figcenter fighighlight">
42+
<img src="/assets/att/dotproduct.png" width="80%">
43+
<div class="figcaption">Dot-Product Attention</div>
44+
</div>
45+
<br>
46+
With query q (D,), value vectors {v_1,...,v_n} where v_i (D,), key vectors {k_1,...,k_n} where k_i (D,), attention weights a_i, and output c (D,).
47+
The output is a weighted average over the value vectors.
48+
49+
- **Self-Attention:** we derive values, keys, and queries from the input
50+
51+
<div class="fig figcenter fighighlight">
52+
<img src="/assets/att/vkq.png" width="80%">
53+
<div class="figcaption">Value, Key, and Query</div>
54+
</div>
55+
<br>
56+
Combining the above two, we can now implement multi-headed scaled dot product attention for transformers.
57+
58+
- **Multi-Headed Scaled Dot Product Attention:** We learn a parameter matrix V_i, K_i, Q_i (DxD) for each head i, which
59+
increases the model’s expressivity to attend to different parts of the input. We apply a scaling term (1/sqrt(d/h)) to
60+
the dot-product attention described previously in order to reduce the effect of large magnitude vectors.
61+
62+
<div class="fig figcenter fighighlight">
63+
<img src="/assets/att/softmax.png" width="80%">
64+
<div class="figcaption">Multi-Headed Scaled Dot Product Attention</div>
65+
</div>
66+
<br>
67+
We can then apply dropout, generate the output of the attention layer, and finally add a linear transformation to the output of the attention operation, which allows the model to learn the relationship between heads, thereby improving the model’s expressivity.
68+
69+
<a name='tips'></a>
70+
71+
### Step-by-Step Multi-Headed Attention with Intermediate Dimensions
72+
73+
There's a lot happening throughout the Multi-Headed Attention so hopefully this chart will help further clarify the
74+
intermediate steps and how the dimensions change after each step!
75+
76+
<div class="fig figcenter fighighlight">
77+
<img src="/assets/att/multiheadgraph.PNG" width="80%">
78+
<div class="figcaption">Step-by-Step Multi-Headed Attention with Intermediate Dimensions</div>
79+
</div>
80+
81+
### A couple tips on Permute and Reshape:
82+
83+
To create the multiple heads, we divide the embedding dimension by the number of heads and use Reshape (Ex: Reshape
84+
allows us to go from shape (N x S x D) to (N x S x H x D//H) ). It is important to note that Reshape doesn’t change the
85+
ordering of your data. It simply takes the original data and ‘reshapes’ it into the dimensions you provide. We use
86+
Permute (or can use Transpose) to rearrange the ordering of dimensions of the data (Ex: Permute allows us to rearrange
87+
the dimensions from (N x S x H x D//H) to (N x H x S x D//H) ). Notice why we needed to use Permute before Reshaping
88+
after the final MatMul operation. Our current tensor had a shape of (N x H x S x D//H) but in order to reshape it to
89+
be (N x S x D) we needed to first ensure that the H and D//H dimensions are right next to each other because reshape
90+
doesn’t change the ordering of the data. Therefore we use Permute first to rearrange the dimensions from (N x H x S x
91+
D//H) to (N x S x H x D//H) and then can use reshape to get the shape of (N x S x D).
92+
93+
<a name='steps'></a>
94+
95+
### Transformer Steps: Encoder-Decoder
96+
97+
### Encoder Block
98+
99+
The role of the Encoder block is to encode all the image features (where the spatial features are extracted using
100+
pretrained CNN) into a set of context vectors. The context vectors outputted are a representation of the input sequence
101+
in a higher dimensional space. We define the Encoder as c = T_W(z) where z is the spatial CNN features and T_w(.) is the
102+
transformer encoder. In the "Attention Is All You Need" paper a transformer encoder block made up of N encoder blocks (N
103+
= 6, D = 512) is used.
104+
105+
<div class="fig figcenter fighighlight">
106+
<img src="/assets/att/encoder.png" width="80%">
107+
<div class="figcaption">Encoder Block</div>
108+
</div>
109+
<br>
110+
111+
Let’s walk through the steps of the Encoder block!
112+
113+
- We first take in a set of input vectors X (where each input vector represents a word for instance)
114+
- We then add positional encoding to the input vectors.
115+
- We pass the positional encoded vectors through the **Multi-head self-attention layer** (where each vector attends on
116+
all the other vectors). The output of this layer gives us a set of context vectors.
117+
- We have a Residual Connection after the Multi-head self-attention layer which allows us to bypass the attention layer
118+
if it’s not needed.
119+
- We then apply Layer Normalization on the output which normalizes each individual vector.
120+
- We then apply MLP over each vector individually.
121+
- We then have another Residual Connection.
122+
- A final Layer Normalization on the output.
123+
- And finally the set of context vectors C is outputted!
124+
125+
### Decoder Block
126+
127+
The Decoder block takes in the set of context vectors C outputted from the encoder block and set of input vectors X and
128+
outputs a set of vectors Y which defines the output sequence. We define the Decoder as y_t = T_S(y_{0:t-1},c) where T_D(
129+
.) is the transformer decoder. In the"Attention Is All You Need" paper a transformer decoder block made up of N decoder
130+
blocks (N = 6, D = 512) is used.
131+
132+
<div class="fig figcenter fighighlight">
133+
<img src="/assets/att/decoder.png" width="80%">
134+
<div class="figcaption">Decoder Block</div>
135+
</div>
136+
<br>
137+
138+
Let’s walk through the steps of the Decoder block!
139+
140+
- We take in the set of input vectors X and context vectors C (outputted from Encoder block)
141+
- We then add positional encoding to the input vectors X.
142+
- We pass the positional encoded vectors through the **Masked Multi-head self-attention layer**. The mask ensures that
143+
we only attend over previous inputs.
144+
- We have a Residual Connection after this layer which allows us to bypass the attention layer if it’s not needed.
145+
- We then apply Layer Normalization on the output which normalizes each individual vector.
146+
- Then we pass the output through another **Multi-head attention layer** which takes in the context vectors outputted by
147+
the Encoder block as well as the output of the Layer Normalization. In this step the Key comes from the set of context
148+
vectors C, the Value comes from the set of context vectors C, and the Query comes from the output of the Layer
149+
Normalization step.
150+
- We then have another Residual Connection.
151+
- Apply another Layer Normalization.
152+
- Apply MLP over each vector individually.
153+
- Another Residual Connection
154+
- A final Layer Normalization
155+
- And finally we pass the output through a Fully-connected layer which produces the final set of output vectors Y which
156+
is the output sequence.
157+
158+
### Additional Notes on Layer Normalization and MLP
159+
160+
**Layer Normalization:** As seen in the Encoder and Decoder block implementation, we use Layer Normalization after the
161+
Residual Connections in both the Encoder and Decoder Blocks. Recall that in Layer Normalization we are normalizing
162+
across the feature dimension (so we are applying LayerNorm over the image features). Using Layer Normalization at these
163+
points helps us prevent issues with vanishing or exploding gradients, helps stabilize the network, and can reduce
164+
training time.
165+
166+
**MLP:** Both the encoder and decoder blocks contain position-wise fully-connected feed-forward networks, which are
167+
“applied to each position separately and identically” (Vaswani et al.). The linear transformations use different
168+
parameters across layers. FFN(x) = max(0, xW_1 + b_1)W_2 + b_2. Additionally, the combination of a self-attention layer
169+
and a point-wise feed-forward layer reduces the complexity required by convolutional layers.
170+
171+
### Additional Resources
172+
173+
Additional resources related to implementation:
174+
175+
- ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762)
176+

0 commit comments

Comments
 (0)