|
| 1 | +Table of Contents: |
| 2 | + |
| 3 | +- [Transformers Overview](#overview) |
| 4 | +- [Why Transformers?](#why) |
| 5 | +- [Multi-Headed Attention](#multihead) |
| 6 | +- [Multi-Headed Attention Tips](#tips) |
| 7 | +- [Transformer Steps: Encoder-Decoder](#steps) |
| 8 | + |
| 9 | +<a name='overview'></a> |
| 10 | + |
| 11 | +### Transformer Overview |
| 12 | + |
| 13 | +In ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762), Vaswani et al. introduced the Transformer, which |
| 14 | +introduces parallelism and enables models to learn long-range dependencies--thereby helping solve two key issues with |
| 15 | +RNNs: their slow speed of training and their difficulty in encoding long-range dependencies. Transformers are highly |
| 16 | +scalable and highly parallelizable, allowing for faster training, larger models, and better performance across vision |
| 17 | +and language tasks. Transformers are beginning to replace RNNs and LSTMs and may soon replace convolutions as well. |
| 18 | + |
| 19 | +<a name='why'></a> |
| 20 | + |
| 21 | +### Why Transformers? |
| 22 | + |
| 23 | +- Transformers are great for working with long input sequences since the attention calculation looks at all inputs. In |
| 24 | + contrast, RNNs struggle to encode long-range dependencies. LSTMs are much better at capturing long-range dependencies |
| 25 | + by using the input, output, and forget gates. |
| 26 | +- Transformers can operate over unordered sets or ordered sequences with positional encodings (using positional encoding |
| 27 | + to add ordering the sets). In contrast, RNN/LSTM expect an ordered sequence of inputs. |
| 28 | +- Transformers use parallel computation where all alignment and attention scores for all inputs can be done in parallel. |
| 29 | + In contrast, RNN/LSTM uses sequential computation since the hidden state at a current timestep can only be computed |
| 30 | + after the previous states are calculated which makes them often slow to train. |
| 31 | + |
| 32 | +<a name='multihead'></a> |
| 33 | + |
| 34 | +### Multi-Headed Attention |
| 35 | + |
| 36 | +Let’s refresh our concepts from the attention unit to help us with transformers. |
| 37 | +<br> |
| 38 | + |
| 39 | +- **Dot-Product Attention:** |
| 40 | + |
| 41 | +<div class="fig figcenter fighighlight"> |
| 42 | + <img src="/assets/att/dotproduct.png" width="80%"> |
| 43 | + <div class="figcaption">Dot-Product Attention</div> |
| 44 | +</div> |
| 45 | +<br> |
| 46 | +With query q (D,), value vectors {v_1,...,v_n} where v_i (D,), key vectors {k_1,...,k_n} where k_i (D,), attention weights a_i, and output c (D,). |
| 47 | +The output is a weighted average over the value vectors. |
| 48 | + |
| 49 | +- **Self-Attention:** we derive values, keys, and queries from the input |
| 50 | + |
| 51 | +<div class="fig figcenter fighighlight"> |
| 52 | + <img src="/assets/att/vkq.png" width="80%"> |
| 53 | + <div class="figcaption">Value, Key, and Query</div> |
| 54 | +</div> |
| 55 | +<br> |
| 56 | +Combining the above two, we can now implement multi-headed scaled dot product attention for transformers. |
| 57 | + |
| 58 | +- **Multi-Headed Scaled Dot Product Attention:** We learn a parameter matrix V_i, K_i, Q_i (DxD) for each head i, which |
| 59 | + increases the model’s expressivity to attend to different parts of the input. We apply a scaling term (1/sqrt(d/h)) to |
| 60 | + the dot-product attention described previously in order to reduce the effect of large magnitude vectors. |
| 61 | + |
| 62 | +<div class="fig figcenter fighighlight"> |
| 63 | + <img src="/assets/att/softmax.png" width="80%"> |
| 64 | + <div class="figcaption">Multi-Headed Scaled Dot Product Attention</div> |
| 65 | +</div> |
| 66 | +<br> |
| 67 | +We can then apply dropout, generate the output of the attention layer, and finally add a linear transformation to the output of the attention operation, which allows the model to learn the relationship between heads, thereby improving the model’s expressivity. |
| 68 | + |
| 69 | +<a name='tips'></a> |
| 70 | + |
| 71 | +### Step-by-Step Multi-Headed Attention with Intermediate Dimensions |
| 72 | + |
| 73 | +There's a lot happening throughout the Multi-Headed Attention so hopefully this chart will help further clarify the |
| 74 | +intermediate steps and how the dimensions change after each step! |
| 75 | + |
| 76 | +<div class="fig figcenter fighighlight"> |
| 77 | + <img src="/assets/att/multiheadgraph.PNG" width="80%"> |
| 78 | + <div class="figcaption">Step-by-Step Multi-Headed Attention with Intermediate Dimensions</div> |
| 79 | +</div> |
| 80 | + |
| 81 | +### A couple tips on Permute and Reshape: |
| 82 | + |
| 83 | +To create the multiple heads, we divide the embedding dimension by the number of heads and use Reshape (Ex: Reshape |
| 84 | +allows us to go from shape (N x S x D) to (N x S x H x D//H) ). It is important to note that Reshape doesn’t change the |
| 85 | +ordering of your data. It simply takes the original data and ‘reshapes’ it into the dimensions you provide. We use |
| 86 | +Permute (or can use Transpose) to rearrange the ordering of dimensions of the data (Ex: Permute allows us to rearrange |
| 87 | +the dimensions from (N x S x H x D//H) to (N x H x S x D//H) ). Notice why we needed to use Permute before Reshaping |
| 88 | +after the final MatMul operation. Our current tensor had a shape of (N x H x S x D//H) but in order to reshape it to |
| 89 | +be (N x S x D) we needed to first ensure that the H and D//H dimensions are right next to each other because reshape |
| 90 | +doesn’t change the ordering of the data. Therefore we use Permute first to rearrange the dimensions from (N x H x S x |
| 91 | +D//H) to (N x S x H x D//H) and then can use reshape to get the shape of (N x S x D). |
| 92 | + |
| 93 | +<a name='steps'></a> |
| 94 | + |
| 95 | +### Transformer Steps: Encoder-Decoder |
| 96 | + |
| 97 | +### Encoder Block |
| 98 | + |
| 99 | +The role of the Encoder block is to encode all the image features (where the spatial features are extracted using |
| 100 | +pretrained CNN) into a set of context vectors. The context vectors outputted are a representation of the input sequence |
| 101 | +in a higher dimensional space. We define the Encoder as c = T_W(z) where z is the spatial CNN features and T_w(.) is the |
| 102 | +transformer encoder. In the "Attention Is All You Need" paper a transformer encoder block made up of N encoder blocks (N |
| 103 | += 6, D = 512) is used. |
| 104 | + |
| 105 | +<div class="fig figcenter fighighlight"> |
| 106 | + <img src="/assets/att/encoder.png" width="80%"> |
| 107 | + <div class="figcaption">Encoder Block</div> |
| 108 | +</div> |
| 109 | +<br> |
| 110 | + |
| 111 | +Let’s walk through the steps of the Encoder block! |
| 112 | + |
| 113 | +- We first take in a set of input vectors X (where each input vector represents a word for instance) |
| 114 | +- We then add positional encoding to the input vectors. |
| 115 | +- We pass the positional encoded vectors through the **Multi-head self-attention layer** (where each vector attends on |
| 116 | + all the other vectors). The output of this layer gives us a set of context vectors. |
| 117 | +- We have a Residual Connection after the Multi-head self-attention layer which allows us to bypass the attention layer |
| 118 | + if it’s not needed. |
| 119 | +- We then apply Layer Normalization on the output which normalizes each individual vector. |
| 120 | +- We then apply MLP over each vector individually. |
| 121 | +- We then have another Residual Connection. |
| 122 | +- A final Layer Normalization on the output. |
| 123 | +- And finally the set of context vectors C is outputted! |
| 124 | + |
| 125 | +### Decoder Block |
| 126 | + |
| 127 | +The Decoder block takes in the set of context vectors C outputted from the encoder block and set of input vectors X and |
| 128 | +outputs a set of vectors Y which defines the output sequence. We define the Decoder as y_t = T_S(y_{0:t-1},c) where T_D( |
| 129 | +.) is the transformer decoder. In the"Attention Is All You Need" paper a transformer decoder block made up of N decoder |
| 130 | +blocks (N = 6, D = 512) is used. |
| 131 | + |
| 132 | +<div class="fig figcenter fighighlight"> |
| 133 | + <img src="/assets/att/decoder.png" width="80%"> |
| 134 | + <div class="figcaption">Decoder Block</div> |
| 135 | +</div> |
| 136 | +<br> |
| 137 | + |
| 138 | +Let’s walk through the steps of the Decoder block! |
| 139 | + |
| 140 | +- We take in the set of input vectors X and context vectors C (outputted from Encoder block) |
| 141 | +- We then add positional encoding to the input vectors X. |
| 142 | +- We pass the positional encoded vectors through the **Masked Multi-head self-attention layer**. The mask ensures that |
| 143 | + we only attend over previous inputs. |
| 144 | +- We have a Residual Connection after this layer which allows us to bypass the attention layer if it’s not needed. |
| 145 | +- We then apply Layer Normalization on the output which normalizes each individual vector. |
| 146 | +- Then we pass the output through another **Multi-head attention layer** which takes in the context vectors outputted by |
| 147 | + the Encoder block as well as the output of the Layer Normalization. In this step the Key comes from the set of context |
| 148 | + vectors C, the Value comes from the set of context vectors C, and the Query comes from the output of the Layer |
| 149 | + Normalization step. |
| 150 | +- We then have another Residual Connection. |
| 151 | +- Apply another Layer Normalization. |
| 152 | +- Apply MLP over each vector individually. |
| 153 | +- Another Residual Connection |
| 154 | +- A final Layer Normalization |
| 155 | +- And finally we pass the output through a Fully-connected layer which produces the final set of output vectors Y which |
| 156 | + is the output sequence. |
| 157 | + |
| 158 | +### Additional Notes on Layer Normalization and MLP |
| 159 | + |
| 160 | +**Layer Normalization:** As seen in the Encoder and Decoder block implementation, we use Layer Normalization after the |
| 161 | +Residual Connections in both the Encoder and Decoder Blocks. Recall that in Layer Normalization we are normalizing |
| 162 | +across the feature dimension (so we are applying LayerNorm over the image features). Using Layer Normalization at these |
| 163 | +points helps us prevent issues with vanishing or exploding gradients, helps stabilize the network, and can reduce |
| 164 | +training time. |
| 165 | + |
| 166 | +**MLP:** Both the encoder and decoder blocks contain position-wise fully-connected feed-forward networks, which are |
| 167 | +“applied to each position separately and identically” (Vaswani et al.). The linear transformations use different |
| 168 | +parameters across layers. FFN(x) = max(0, xW_1 + b_1)W_2 + b_2. Additionally, the combination of a self-attention layer |
| 169 | +and a point-wise feed-forward layer reduces the complexity required by convolutional layers. |
| 170 | + |
| 171 | +### Additional Resources |
| 172 | + |
| 173 | +Additional resources related to implementation: |
| 174 | + |
| 175 | +- ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762) |
| 176 | + |
0 commit comments