Skip to content

Commit b19c4ec

Browse files
committed
master: training NMT.
1 parent 8d9877d commit b19c4ec

File tree

1 file changed

+13
-13
lines changed
  • Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT

1 file changed

+13
-13
lines changed

Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/NMT SetUp.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -30,43 +30,43 @@ Let us assume we want to train an image captioning model, and the ground truth c
3030
## Training NMT
3131

3232
1. The initial `select` makes two copies. Each of the input tokens represented by zero (English words) and the target tokens (German words) represented by one.<br>
33-
<img src="./images/15. step - 1.png" width="30%"></img><br><br>
33+
<img src="./images/15. step - 1.png" width="40%"></img><br><br>
3434

3535
2. One copy of the input tokens are fed into the inputs encoder to be transformed into the key and value vectors. <br>
36-
<img src="./images/16. step - 2.png" width="30%"></img><br><br>
36+
<img src="./images/16. step - 2.png" width="40%"></img><br><br>
3737

3838
3. While a copy of the target tokens goes into the pre-attention decoder.<br>
39-
<img src="./images/17. step - 3.png" width="30%"></img><br><br>
39+
<img src="./images/17. step - 3.png" width="40%"></img><br><br>
4040

4141
4. The pre-attention decoder is transforming the prediction targets into a different vector space called the query vector. That's going to calculate the relative weights to give each input weight. The pre-attention decoder takes the target tokens and shifts them one place to the right. This is where the teacher forcing takes place. Every token will be shifted one place to the right, and in start of a sentence token, will be a sign to the beginning of each sequence.<br>
42-
<img src="./images/18. step - 4.png" width="30%"></img><br><br>
42+
<img src="./images/18. step - 4.png" width="40%"></img><br><br>
4343

4444
5. Next, the inputs and targets are converted to embeddings or initial representations of the words.<br>
45-
<img src="./images/19. step - 5.png" width="30%"></img><br><br>
45+
<img src="./images/19. step - 5.png" width="40%"></img><br><br>
4646

4747
6. Now that you have your query key and value vectors, you can prepare them for the attention layer. The mask is used after the computation of the Q, K transpose. This before computing the softmax, the where operator in your programming assignment will convert the zero-padding tokens to negative one billion, which will become approximately zero when computing the softmax. That's how padding works.<br>
48-
<img src="./images/20. step - 6.png" width="30%"></img><br><br>
48+
<img src="./images/20. step - 6.png" width="40%"></img><br><br>
4949

5050
7. The residual block adds the queries generated in the pre-attention decoder to the results of the attention layer. <br>
51-
<img src="./images/21. step - 7.png" width="30%"></img><br><br>
51+
<img src="./images/21. step - 7.png" width="40%"></img><br><br>
5252

5353
8. The attention layer then outputs its activations along with the mask that was created earlier. <br>
54-
<img src="./images/22. step - 8.png" width="30%"></img><br><br>
54+
<img src="./images/22. step - 8.png" width="40%"></img><br><br>
5555

5656
9. It's time to drop the mask before running everything through the decoder, which is what the second Select is doing. It takes the activations from the attention layer or the zero, and the second copy of the target tokens, or the two. Would you remember from way back at the beginning. These are the true targets which the decoder needs to compare against the predictions. <br>
57-
<img src="./images/23. step - 9.png" width="30%"></img><br><br>
57+
<img src="./images/23. step - 9.png" width="40%"></img><br><br>
5858

5959
10. Then run everything through a dense layer or a simple linear layer with your targets vocab size. This gives your output the right size.<br>
60-
<img src="./images/24. step - 10.png" width="30%"></img><br><br>
60+
<img src="./images/24. step - 10.png" width="40%"></img><br><br>
6161

6262
11. Finally, you will take the outputs and run it through LogSoftmax, which is what transforms the attention weights to a distribution between zero and one. <br>
63-
<img src="./images/25. step - 11.png" width="30%"></img><br><br>
63+
<img src="./images/25. step - 11.png" width="40%"></img><br><br>
6464

6565
12. Those last four steps comprise your decoder.<br>
66-
<img src="./images/26. step - 12.png" width="30%"></img><br><br>
66+
<img src="./images/26. step - 12.png" width="40%"></img><br><br>
6767

6868
13. The true target tokens are still hanging out here, and we'll pass down along with the log probabilities to be matched against the predictions.<br>
69-
<img src="./images/27. step - 13.png" width="30%"></img><br><br>
69+
<img src="./images/27. step - 13.png" width="40%"></img><br><br>
7070

7171

7272

0 commit comments

Comments
 (0)