You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/Readme.md
+54-1Lines changed: 54 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,8 @@ When performing word alignment, your model needs to be able to identify relation
27
27
In a model that has a vector for each input, there needs to be a way to focus more attention in the right places. Many languages don't translate exactly into another language. To be able to align the words correctly, you need to add a layer to help the decoder understand which inputs are more important for each prediction.<br>
28
28
<imgsrc="./images/4. alignment and attention.png"width="70%"><img> <br><br>
29
29
30
-
### Attention and Alignment
30
+
## Attention and Alignment
31
+
Attention is basically an additional layer that lets a model focus on what's important.
31
32
Below is a step-by-step algorithm for NMT:
32
33
1.*Prepare the encoder hidden state and decoder hidden state.*
33
34
2.*Score each of the encoder hidden state by getting its dot product between each encoder state and decoder hidden states.*<br>
@@ -37,3 +38,55 @@ Below is a step-by-step algorithm for NMT:
37
38
5.*Now just add up everything in the alignments vector to arrive at what's called the context vector, which is then fed to the decoder.*
38
39
39
40
<imgsrc="./images/5. Calculating alignment for NMT model.png"><img> <br><br>
41
+
42
+
## Information Retreival via Attention
43
+
44
+
TLDR: Attention takes in a query, selecting a place where the highest likelihood to look for the key, then finding the key.
45
+
46
+
1. The attention mechanism uses encoded representations of both the input or the encoder hidden states and the outputs or the decoder hidden states.
47
+
48
+
2. The keys and values are pairs. Both of dimension N, where N is the input sequence length and comes from the encoder hidden states.
49
+
50
+
3. The queries come from the decoder hidden states.
51
+
52
+
4. Both the key value pair and the query enter the attention layer from their places on opposite ends of the model.
53
+
54
+
5. Once inside, the dot product of the query and the key is calculated (measure of similarity b/w key and query). Keep in mind that the dot product of similar vectors tends to have a higher value.
55
+
56
+
6. The weighted sum given to each value is determined by the probability (run through softmax function) that the key matches the query.
57
+
58
+
7. Then, the query is mapped to the next key value pair and so on and so forth. This is called *scaled dot product attention*.<br><br>
Consider a matrix, where words of one query (Q) are represented by rows keys (K) and words of the keys are represented by values (V).
65
+
<br>
66
+
67
+
The value score (V) is assigned based on the closeness of the match.<br>
68
+
69
+
```buildoutcfg
70
+
Attention = Softmax(QK^T)V
71
+
```
72
+
<br><br>
73
+
<imgsrc="./images/7. attention visual - 1.png"width="50%"></img> <imgsrc="./images/8. NMT with attention"width="50%"></img> <br><br>
74
+
75
+
### Flexible Attention
76
+
77
+
In a situation (as shown below) where the grammar of foreign language requires a difference word order than the other, the attention is flexible enough to find the connection. <br><br>
0 commit comments