|
| 1 | +# Dot Product Attention |
| 2 | + |
| 3 | +Below steps describe in detail as to how a *dot-product attention* works: |
| 4 | + |
| 5 | +*Imp: Queries: German words.* |
| 6 | + |
| 7 | +1. Let's consider the phrase in English, *"I am happy"*. |
| 8 | +First, the word *I* is embedded, to obtain a vector representation that holds continuous values which is unique for every single word. |
| 9 | +<img src="../images/7.step - 1.png"></img><br> |
| 10 | + |
| 11 | +2. By feeding three distinct linear layers, you get three different vectors for queries, keys and values.<br><br> |
| 12 | +<img src="../images/8. step - 2.png"></img><br> |
| 13 | + |
| 14 | +3. Then you can do the same for the word *am* to output a second vector. <br><br> |
| 15 | +<img src="../images/9. step - 3.png"></img><br> |
| 16 | + |
| 17 | +4. Finally the word *happy* to get a third vector and form the queries, keys and values matrix.<br><br> |
| 18 | +<img src="../images/10. step - 4.png"></img><br> |
| 19 | + |
| 20 | +5. From both the Q matrix and the K matrix, the attention model calculates weights or scores representing the relative importance of the keys for a specific query. |
| 21 | +<img src="../images/11. step - 5.png"></img><br> |
| 22 | + |
| 23 | +6. These attention weights can be understood as alignment scores as they come from a dot product. <br><br> |
| 24 | +<img src="../images/12. step - 6.png"></img><br> |
| 25 | + |
| 26 | +7. Additionally, to turn these weights into probabilities, a softmax function is required.<br><br> |
| 27 | +<img src="../images/13. step - 7.png"></img><br> |
| 28 | + |
| 29 | +7. Finally, multiplying these probabilities with the values, you will then get a weighted sequence, which is the attention results itself.<br><br> |
| 30 | +<img src="../images/14. step - 8.png"></img><br> |
| 31 | + |
| 32 | + |
0 commit comments