You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Chapter-wise code/Code - PyTorch/7. Attention Models/2. Neural Text Summarization/1. Transformer Models/Readme.md
+23-1Lines changed: 23 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,4 +7,26 @@ the translation and as we know, with large sequences, the information tends to g
7
7
LSTMs and GRUs can help to overcome the vanishing gradient problem, but even those will fail to process long sequences.<br><br>
8
8
<imgsrc="../images/1. drawbacks of seq2seq.png"width="50%"></img><br>
9
9
10
-
2.
10
+
2. In a conventional Encoder-decoder architeture, the model would again take T timesteps to compute the translation.<br><br>
5. Transformers also use positional encoding to capture sequential information. The positional encoding out puts values to be added to the embeddings. That's where every input word that is given to the model you have some of the information about it's order and the position.
6. Unlike the recurrent layer, the multi-head attention layer computes the outputs of each inputs in the sequence independently then it allows us to parallelize the computation. But it fails to model the sequential information for a given sequence. That is why you need to incorporate the positional encoding stage into the transformer model.
0 commit comments