purvasingh96
diff --git a/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/Readme.md‎
Lines changed: 54 additions & 1 deletion b/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/Readme.md‎
Lines changed: 54 additions & 1 deletion
diff --git a/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/images/6. Inside attention layer.png‎
284 KB b/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/images/6. Inside attention layer.png‎
284 KB
diff --git a/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/images/7. attention visual - 1.png‎
117 KB b/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/images/7. attention visual - 1.png‎
117 KB
diff --git a/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/images/8. NMT with attention.png‎
253 KB b/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/images/8. NMT with attention.png‎
253 KB
diff --git a/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/images/9. flexible attention.png‎
207 KB b/‎Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/images/9. flexible attention.png‎
207 KB
@@ -27,7 +27,8 @@ When performing word alignment, your model needs to be able to identify relation
 In a model that has a vector for each input, there needs to be a way to focus more attention in the right places. Many languages don't translate exactly into another language. To be able to align the words correctly, you need to add a layer to help the decoder understand which inputs are more important for each prediction.<br>
 <img src="./images/4. alignment and attention.png" width="70%"><img> <br><br>
 
-### Attention and Alignment
+## Attention and Alignment
+Attention is basically an additional layer that lets a model focus on what's important. 
 Below is a step-by-step algorithm for NMT:
 1. *Prepare the encoder hidden state and decoder hidden state.*
 2. *Score each of the encoder hidden state by getting its dot product between each encoder state and decoder hidden states.*<br>
@@ -37,3 +38,55 @@ Below is a step-by-step algorithm for NMT:
 5. *Now just add up everything in the alignments vector to arrive at what's called the context vector, which is then fed to the decoder.*
 
 <img src="./images/5. Calculating alignment for NMT model.png"><img> <br><br>
+
+## Information Retreival via Attention
+
+TLDR: Attention takes in a query, selecting a place where the highest likelihood to look for the key, then finding the key.
+
+1. The attention mechanism uses encoded representations of both the input or the encoder hidden states and the outputs or the decoder hidden states.
+
+2. The keys and values are pairs. Both of dimension N, where N is the input sequence length and comes from the encoder hidden states. 
+
+3. The queries come from the decoder hidden states.
+
+4. Both the key value pair and the query enter the attention layer from their places on opposite ends of the model.
+
+5. Once inside, the dot product of the query and the key is calculated (measure of similarity b/w key and query). Keep in mind that the dot product of similar vectors tends to have a higher value.
+
+6. The weighted sum given to each value is determined by the probability (run through softmax function) that the key matches the query.
+
+7. Then, the query is mapped to the next key value pair and so on and so forth. This is called *scaled dot product attention*.<br><br>
+
+<img src="./images/6. Inside attention layer.png" width="50%"></img><br><br>
+
+## Attention Visualization
+
+Consider a matrix, where words of one query (Q) are represented by rows  keys (K) and words of the keys are represented by values (V).
+<br>
+
+The value score (V) is assigned based on the closeness of the match.<br>
+
+```buildoutcfg
+Attention = Softmax(QK^T)V
+``` 
+<br><br>
+<img src="./images/7. attention visual - 1.png" width="50%"></img> <img src="./images/8. NMT with attention" width="50%"></img> <br><br>
+
+### Flexible Attention
+
+In a situation (as shown below) where the grammar of foreign language requires a difference word order than the other, the attention is flexible enough to find the connection. <br><br>
+
+<img src="./images/9. flexible attention.png" width="50%"></img><br><br>
+
+The first four tokens, the agreements on the, are pretty straightforward, but then the grammatical structure between French and English changes. 
+
+
+
+
+
+
+
+
+ 
+
+