Skip to content

Commit 073b80a

Browse files
committed
master: multi-head attention - summary.
1 parent 9217463 commit 073b80a

File tree

2 files changed

+7
-0
lines changed

2 files changed

+7
-0
lines changed

Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/1. Transformer Models/Multi Head Attention.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Multi-Head Attention
22

3+
## Overview
4+
35
1. Input to multi-head attention is a set of 3 values: Queries, Keys and Values.<br><br>
46
<img src="../images/26. step -1 .png" width="50%"></img><br>
57
2. To achieve the multiple lookups, you first use a fully-connected, dense linear layer on each query, key, and value. This layer will create the representations for parallel attention heads. <br><br>
@@ -11,3 +13,8 @@
1113
<img src="../images/29. step - 4.png" width="50%"></img><br>
1214
6. The scale dot-product is the one used in the dot-product attention model except by the scale factor, one over square root of DK. DK is the key inquiry dimension. It's normalization prevents the gradients from the function to be extremely small when large values of D sub K are used.<br><br>
1315
<img src="../images/30. step - 5.png" width="50%"></img><br>
16+
17+
## Summary
18+
<img src="../images/31. multi-head attention.png"></img><br>
19+
20+
## Maths behind Multi-Head Attention
346 KB
Loading

0 commit comments

Comments
 (0)