Skip to content

Commit c216efc

Browse files
committed
master: multi-head attention overview.
1 parent 7721e82 commit c216efc

File tree

1 file changed

+5
-5
lines changed
  • Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/1. Transformer Models

1 file changed

+5
-5
lines changed
Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# Multi-Head Attention
22

33
1. Input to multi-head attention is a set of 3 values: Queries, Keys and Values.<br><br>
4-
<img src="../images/26. step -1.png" width="50%"></img><br>
4+
<img src="../images/26. step - 1.png" width="50%"></img><br>
55
2. To achieve the multiple lookups, you first use a fully-connected, dense linear layer on each query, key, and value. This layer will create the representations for parallel attention heads. <br><br>
6-
<img src="../images/27. step -2.png" width="50%"></img><br>
6+
<img src="../images/27. step - 2" width="50%"></img><br>
77
3. Here, you split these vectors into number of heads and perform attention on them as each head was different.<br><br>
88
4. Then the result of the attention will be concatenated back together.<br><br>
9-
<img src="../images/28. step -3.png" width="50%"></img><br>
9+
<img src="../images/28. step - 3.png" width="50%"></img><br>
1010
5. Finally, the concatenated attention will be put through a final fully connected layer.<br><br>
11-
<img src="../images/29. step -4.png" width="50%"></img><br>
11+
<img src="../images/29. step - 4.png" width="50%"></img><br>
1212
6. The scale dot-product is the one used in the dot-product attention model except by the scale factor, one over square root of DK. DK is the key inquiry dimension. It's normalization prevents the gradients from the function to be extremely small when large values of D sub K are used.<br><br>
13-
<img src="../images/30. step -5.png" width="50%"></img><br>
13+
<img src="../images/30. step - 5.png" width="50%"></img><br>

0 commit comments

Comments
 (0)