Skip to content

Commit 8310245

Browse files
committed
master: causal attention.
1 parent 153c643 commit 8310245

File tree

7 files changed

+30
-0
lines changed

7 files changed

+30
-0
lines changed

Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/1. Transformer Models/Attention Maths.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,6 @@ often called *attention weights*. The shape of this matrix is `[Lq, Lk]`.<br>
4444
## Attention Formula
4545

4646
<img src="../images/20. attention formula.png" width="50%"></img> <br><br>
47+
48+
## Next Up
49+
Next, we will learn about different types of attention: causal attention and multi-head attention.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Causal (self) Attention
2+
3+
## Overview
4+
5+
1. In Causal attention, keys and words are from the same sentence. Hence the name, *self attention*.
6+
2. Queries in causal attention are allowed to look at words that occured in the past.<br>
7+
<img src="../images/21. causal attention overview.png" width="50%"></img><br><br>
8+
9+
## Causal Attention Math
10+
The difference between a dot-product attention and causal attention is the matrix *Mask*.
11+
12+
1. For causal attention, you could compute attention weights in the same way as before `softmax(Q.KTranspose)`.
13+
But that way, you are allowing the model to attend to words in the future.<br>
14+
<img src="../images/22. step - 1.png" width="50%"></img><br><br>
15+
16+
2. To solve this issue, you add a mask by a sum of size L by L.
17+
So you compute softmax of Q times K transpose + M.
18+
19+
<img src="../images/23. step - 2.png" width="50%"></img><br><br>
20+
21+
3. When you add M to Q times K transpose, all values on the diagonal and below which correspond to queries attending words in the past are untouched.
22+
All other values become minus infinity. After a softmax, all minus infinities will become equal to 0, as exponents of negative infinity is equal to 0, so it prevents words from attending to the future.
23+
<img src="../images/24. step - 3.png" width="50%"></img><br><br>
24+
25+
4. The last step is the same as dot-product attention!
26+
27+
<img src="../images/25. step - 4.png" width="50%"></img><br><br>
112 KB
Loading
337 KB
Loading
193 KB
Loading
345 KB
Loading
401 KB
Loading

0 commit comments

Comments
 (0)