Skip to content

Commit 35c19b2

Browse files
committed
master: set up of NMT.
1 parent b68bac9 commit 35c19b2

File tree

5 files changed

+22
-1
lines changed

5 files changed

+22
-1
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Setup for Machine Translation
2+
3+
## Data in NMT
4+
5+
Below we have the data sequence in English, *I'm hungry*, and on the right, the corresponding German equivalent.
6+
Further down we have, *I watch the soccer game*, and, the corresponding German equivalent.
7+
We are going to have a great many of these inputs. One thing to note here is that the data set used is not entirely clean.
8+
9+
<img src="./images/10. data in NMT.png" width="50%"></img><br><br>
10+
11+
## Pre-requisites
12+
13+
1. *Input*: Take English sentence as input.
14+
2. *Tokenization*: State-of-the-art models use pre-trained word vectors, else, represent words with one-hot vectors to create the input.
15+
3. *Padding*: Pad the tokenized sequence to make the inputs of equal length.<br><br>
16+
<img src="./images/11. NMT setup-english.png" width="50%"></img><br><br>
17+
4. Repeat steps 1-3 for the German sentences as well.<br><br>
18+
<img src="./images/11. NMT setup-german.png" width="50%"></img><br><br>
19+
5. Keep track of index mappings with word2index and index2word mappings.
20+
5. Use start-of-sentence `<SOS>` and end-of-sentence `<EOS>` tokens to represent the same.

Chapter-wise code/Code - PyTorch/7. Attention Models/1. NMT/Readme.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,8 +81,9 @@ In a situation (as shown below) where the grammar of foreign language requires a
8181
The first four tokens, the agreements on the, are pretty straightforward, but then the grammatical structure between French and English changes. Now instead of looking at the corresponding fifth token to translate the French word zone, the attention knows to look further down at the eighth token, which corresponds to the English word area, glorious and necessary.
8282

8383

84+
## Next Up
8485

85-
86+
Next, we will learn about the set-up required to build a NMT model and what kind of dataset is used to build a NMT model. The readme file for the same is here.
8687

8788

8889

181 KB
Loading
552 KB
Loading
685 KB
Loading

0 commit comments

Comments
 (0)