Skip to content

Commit 4f2edfd

Browse files
committed
master: How attention works internally.
1 parent 3ae712e commit 4f2edfd

File tree

5 files changed

+54
-1
lines changed

5 files changed

+54
-1
lines changed

Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/Readme.md

Lines changed: 54 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@ When performing word alignment, your model needs to be able to identify relation
2727
In a model that has a vector for each input, there needs to be a way to focus more attention in the right places. Many languages don't translate exactly into another language. To be able to align the words correctly, you need to add a layer to help the decoder understand which inputs are more important for each prediction.<br>
2828
<img src="./images/4. alignment and attention.png" width="70%"><img> <br><br>
2929

30-
### Attention and Alignment
30+
## Attention and Alignment
31+
Attention is basically an additional layer that lets a model focus on what's important.
3132
Below is a step-by-step algorithm for NMT:
3233
1. *Prepare the encoder hidden state and decoder hidden state.*
3334
2. *Score each of the encoder hidden state by getting its dot product between each encoder state and decoder hidden states.*<br>
@@ -37,3 +38,55 @@ Below is a step-by-step algorithm for NMT:
3738
5. *Now just add up everything in the alignments vector to arrive at what's called the context vector, which is then fed to the decoder.*
3839

3940
<img src="./images/5. Calculating alignment for NMT model.png"><img> <br><br>
41+
42+
## Information Retreival via Attention
43+
44+
TLDR: Attention takes in a query, selecting a place where the highest likelihood to look for the key, then finding the key.
45+
46+
1. The attention mechanism uses encoded representations of both the input or the encoder hidden states and the outputs or the decoder hidden states.
47+
48+
2. The keys and values are pairs. Both of dimension N, where N is the input sequence length and comes from the encoder hidden states.
49+
50+
3. The queries come from the decoder hidden states.
51+
52+
4. Both the key value pair and the query enter the attention layer from their places on opposite ends of the model.
53+
54+
5. Once inside, the dot product of the query and the key is calculated (measure of similarity b/w key and query). Keep in mind that the dot product of similar vectors tends to have a higher value.
55+
56+
6. The weighted sum given to each value is determined by the probability (run through softmax function) that the key matches the query.
57+
58+
7. Then, the query is mapped to the next key value pair and so on and so forth. This is called *scaled dot product attention*.<br><br>
59+
60+
<img src="./images/6. Inside attention layer.png" width="50%"></img><br><br>
61+
62+
## Attention Visualization
63+
64+
Consider a matrix, where words of one query (Q) are represented by rows keys (K) and words of the keys are represented by values (V).
65+
<br>
66+
67+
The value score (V) is assigned based on the closeness of the match.<br>
68+
69+
```buildoutcfg
70+
Attention = Softmax(QK^T)V
71+
```
72+
<br><br>
73+
<img src="./images/7. attention visual - 1.png" width="50%"></img> <img src="./images/8. NMT with attention" width="50%"></img> <br><br>
74+
75+
### Flexible Attention
76+
77+
In a situation (as shown below) where the grammar of foreign language requires a difference word order than the other, the attention is flexible enough to find the connection. <br><br>
78+
79+
<img src="./images/9. flexible attention.png" width="50%"></img><br><br>
80+
81+
The first four tokens, the agreements on the, are pretty straightforward, but then the grammatical structure between French and English changes.
82+
83+
84+
85+
86+
87+
88+
89+
90+
91+
92+
284 KB
Loading
117 KB
Loading
253 KB
Loading
207 KB
Loading

0 commit comments

Comments
 (0)