11"""
2- Language Translation with TorchText
2+ TorchText๋ก ์ธ์ด ๋ฒ์ญํ๊ธฐ
33===================================
44
5- This tutorial shows how to use several convenience classes of ``torchtext`` to preprocess
6- data from a well-known dataset containing sentences in both English and German and use it to
7- train a sequence-to-sequence model with attention that can translate German sentences
8- into English.
5+ ์ด ํํ ๋ฆฌ์ผ์์๋ ``torchtext`` ์ ์ ์ฉํ ์ฌ๋ฌ ํด๋์ค๋ค๊ณผ ์ํ์ค ํฌ ์ํ์ค(sequence-to-sequence, seq2seq)๋ชจ๋ธ์ ํตํด
6+ ์์ด์ ๋
์ผ์ด ๋ฌธ์ฅ๋ค์ด ํฌํจ๋ ์ ๋ช
ํ ๋ฐ์ดํฐ ์
์ ์ด์ฉํด์ ๋
์ผ์ด ๋ฌธ์ฅ์ ์์ด๋ก ๋ฒ์ญํด ๋ณผ ๊ฒ์
๋๋ค.
97
10- It is based off of
11- `this tutorial <https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb >`__
12- from PyTorch community member `Ben Trevett <https://github.com/bentrevett>`__
13- and was created by `Seth Weidman <https://github.com/SethHWeidman/>`__ with Ben's permission .
8+ ์ด ํํ ๋ฆฌ์ผ์
9+ PyTorch ์ปค๋ฎค๋ํฐ ๋ฉค๋ฒ์ธ `Ben Trevett <https://github.com/bentrevett>`__ ์ด ์์ฑํ
10+ `ํํ ๋ฆฌ์ผ <https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb >`__ ์ ๊ธฐ์ดํ๊ณ ์์ผ๋ฉฐ
11+ `Seth Weidman <https://github.com/SethHWeidman/>`__ ์ด Ben์ ํ๋ฝ์ ๋ฐ๊ณ ๋ง๋ค์์ต๋๋ค .
1412
15- By the end of this tutorial, you will be able to :
13+ ์ด ํํ ๋ฆฌ์ผ์ ํตํด ์ฌ๋ฌ๋ถ์ ๋ค์๊ณผ ๊ฐ์ ๊ฒ์ ํ ์ ์๊ฒ ๋ฉ๋๋ค :
1614
17- - Preprocess sentences into a commonly-used format for NLP modeling using the following ``torchtext`` convenience classes :
15+ - ``torchtext`` ์ ์๋์ ๊ฐ์ ์ ์ฉํ ํด๋์ค๋ค์ ํตํด ๋ฌธ์ฅ๋ค์ NLP๋ชจ๋ธ๋ง์ ์์ฃผ ์ฌ์ฉ๋๋ ํํ๋ก ์ ์ฒ๋ฆฌํ ์ ์๊ฒ ๋ฉ๋๋ค :
1816 - `TranslationDataset <https://torchtext.readthedocs.io/en/latest/datasets.html#torchtext.datasets.TranslationDataset>`__
1917 - `Field <https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.Field>`__
2018 - `BucketIterator <https://torchtext.readthedocs.io/en/latest/data.html#torchtext.data.BucketIterator>`__
2119"""
2220
2321######################################################################
24- # `Field` and `TranslationDataset`
22+ # `Field` ์ `TranslationDataset`
2523# ----------------
26- # ``torchtext`` has utilities for creating datasets that can be easily
27- # iterated through for the purposes of creating a language translation
28- # model. One key class is a
29- # `Field <https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L64>`__,
30- # which specifies the way each sentence should be preprocessed, and another is the
31- # `TranslationDataset` ; ``torchtext``
32- # has several such datasets; in this tutorial we'll use the
33- # `Multi30k dataset <https://github.com/multi30k/dataset>`__, which contains about
34- # 30,000 sentences (averaging about 13 words in length) in both English and German.
24+ # ``torchtext`` ์๋ ์ธ์ด ๋ณํ ๋ชจ๋ธ์ ๋ง๋ค๋ ์ฝ๊ฒ ์ฌ์ฉํ ์ ์๋ ๋ฐ์ดํฐ์
์ ๋ง๋ค๊ธฐ ์ ํฉํ ๋ค์ํ ๋๊ตฌ๊ฐ ์์ต๋๋ค.
25+ # ๊ทธ ์ค์์๋ ์ค์ํ ํด๋์ค ์ค ํ๋์ธ `Field <https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L64>`__ ๋
26+ # ๊ฐ ๋ฌธ์ฅ์ด ์ด๋ป๊ฒ ์ ์ฒ๋ฆฌ๋์ด์ผ ํ๋์ง ์ง์ ํ๋ฉฐ, ๋ ๋ค๋ฅธ ์ค์ํ ํด๋์ค๋ก๋ `TranslationDataset` ์ด ์์ต๋๋ค.
27+ # ``torchtext`` ์๋ ์ด ์ธ์๋ ๋น์ทํ ๋ฐ์ดํฐ์
๋ค์ด ์๋๋ฐ, ์ด๋ฒ ํํ ๋ฆฌ์ผ์์๋ `Multi30k dataset <https://github.com/multi30k/dataset>`__ ์ ์ฌ์ฉํ ๊ฒ์
๋๋ค.
28+ # ์ด ๋ฐ์ดํฐ ์
์ ํ๊ท ์ฝ 13๊ฐ์ ๋จ์ด๋ก ๊ตฌ์ฑ๋ ์ฝ ์ผ๋ง ๊ฐ์ ๋ฌธ์ฅ์ ์์ด์ ๋
์ผ์ด ๋ ์ธ์ด๋ก ํฌํจํ๊ณ ์์ต๋๋ค.
3529#
36- # Note: the tokenization in this tutorial requires `Spacy <https://spacy.io>`__
37- # We use Spacy because it provides strong support for tokenization in languages
38- # other than English. ``torchtext`` provides a ``basic_english`` tokenizer
39- # and supports other tokenizers for English (e.g.
40- # `Moses <https://bitbucket.org/luismsgomes/mosestokenizer/src/default/>`__)
41- # but for language translation - where multiple languages are required -
42- # Spacy is your best bet.
30+ # ์ฐธ๊ณ : ์ด ํํ ๋ฆฌ์ผ์์์ ํ ํฐํ(tokenization)์๋ `Spacy <https://spacy.io>`__ ๊ฐ ํ์ํฉ๋๋ค.
31+ # Spacy๋ ์์ด ์ด ์ธ์ ๋ค๋ฅธ ์ธ์ด์ ๋ํ ๊ฐ๋ ฅํ ํ ํฐํ ๊ธฐ๋ฅ์ ์ ๊ณตํ๊ธฐ ๋๋ฌธ์ ์ฌ์ฉํฉ๋๋ค. ``torchtext`` ๋
32+ # `basic_english`` ํ ํฌ๋์ด์ ๋ฅผ ์ ๊ณตํ ๋ฟ ์๋๋ผ ์์ด์ ์ฌ์ฉํ ์ ์๋ ๋ค๋ฅธ ํ ํฌ๋์ด์ ๋ค(์์ปจ๋ฐ
33+ # `Moses <https://bitbucket.org/luismsgomes/mosestokenizer/src/default/>`__ )์ ์ง์ํฉ๋๋ค๋ง, ์ธ์ด ๋ฒ์ญ์ ์ํด์๋ ๋ค์ํ ์ธ์ด๋ฅผ
34+ # ๋ค๋ฃจ์ด์ผ ํ๊ธฐ ๋๋ฌธ์ Spacy๊ฐ ๊ฐ์ฅ ์ ํฉํฉ๋๋ค.
4335#
44- # To run this tutorial, first install ``spacy `` using ``pip `` or ``conda``.
45- # Next, download the raw data for the English and German Spacy tokenizers:
36+ # ์ด ํํ ๋ฆฌ์ผ์ ์คํํ๋ ค๋ฉด, ์ฐ์ ``pip `` ๋ ``conda `` ๋ก ``spacy`` ๋ฅผ ์ค์นํ์ธ์. ๊ทธ ๋ค์,
37+ # Spacy ํ ํฌ๋์ด์ ๊ฐ ์ธ ์์ด์ ๋
์ผ์ด์ ๋ํ ๋ฐ์ดํฐ๋ฅผ ๋ค์ด๋ก๋ ๋ฐ์ต๋๋ค.
4638#
4739# ::
4840#
4941# python -m spacy download en
5042# python -m spacy download de
5143#
52- # With Spacy installed, the following code will tokenize each of the sentences
53- # in the ``TranslationDataset`` based on the tokenizer defined in the ``Field``
54-
44+ # Spacy๊ฐ ์ค์น๋์ด ์๋ค๋ฉด, ๋ค์ ์ฝ๋๋ ``TranslationDataset`` ์ ์๋ ๊ฐ ๋ฌธ์ฅ์ ``Field`` ์ ์ ์๋
45+ # ๋ด์ฉ์ ๊ธฐ๋ฐ์ผ๋ก ํ ํฐํํ ๊ฒ์
๋๋ค.
5546from torchtext .datasets import Multi30k
5647from torchtext .data import Field , BucketIterator
5748
7162 fields = (SRC , TRG ))
7263
7364######################################################################
74- # Now that we've defined ``train_data``, we can see an extremely useful
75- # feature of ``torchtext``'s ``Field``: the ``build_vocab`` method
76- # now allows us to create the vocabulary associated with each language
65+ # ์ด์ ``train_data`` ๋ฅผ ์ ์ํ์ผ๋, ``torchtext`` ์ ``Field`` ์ ์๋ ์์ฒญ๋๊ฒ ์ ์ฉํ ๊ธฐ๋ฅ์
66+ # ๋ณด๊ฒ ๋ ๊ฒ์
๋๋ค : ๋ฐ๋ก ``build_vovab`` ๋ฉ์๋(method)๋ก ๊ฐ ์ธ์ด์ ์ฐ๊ด๋ ์ดํ๋ค์ ๋ง๋ค์ด ๋ผ ๊ฒ์
๋๋ค.
7767
7868SRC .build_vocab (train_data , min_freq = 2 )
7969TRG .build_vocab (train_data , min_freq = 2 )
8070
8171######################################################################
82- # Once these lines of code have been run, ``SRC.vocab.stoi`` will be a
83- # dictionary with the tokens in the vocabulary as keys and their
84- # corresponding indices as values; ``SRC.vocab.itos`` will be the same
85- # dictionary with the keys and values swapped. We won't make extensive
86- # use of this fact in this tutorial, but this will likely be useful in
87- # other NLP tasks you'll encounter.
72+ # ์ ์ฝ๋๊ฐ ์คํ๋๋ฉด, ``SRC.vocab.stoi`` ๋ ์ดํ์ ํด๋นํ๋ ํ ํฐ์ ํค๋ก, ๊ด๋ จ๋ ์์ธ์ ๊ฐ์ผ๋ก ๊ฐ์ง๋
73+ # ์ฌ์ (dict)์ด ๋ฉ๋๋ค. ``SRC.vocab.itos`` ์ญ์ ์ฌ์ (dict)์ด์ง๋ง, ํค์ ๊ฐ์ด ์๋ก ๋ฐ๋์
๋๋ค. ์ด ํํ ๋ฆฌ์ผ์์๋
74+ # ๊ทธ๋ค์ง ์ค์ํ์ง ์์ ๋ด์ฉ์ด์ง๋ง, ์ด๋ฐ ํน์ฑ์ ๋ค๋ฅธ ์์ฐ์ด ์ฒ๋ฆฌ ๋ฑ์์ ์ ์ฉํ๊ฒ ์ฌ์ฉํ ์ ์์ต๋๋ค.
8875
8976######################################################################
9077# ``BucketIterator``
9178# ----------------
92- # The last ``torchtext`` specific feature we'll use is the ``BucketIterator``,
93- # which is easy to use since it takes a ``TranslationDataset`` as its
94- # first argument. Specifically, as the docs say:
95- # Defines an iterator that batches examples of similar lengths together.
96- # Minimizes amount of padding needed while producing freshly shuffled
97- # batches for each new epoch. See pool for the bucketing procedure used.
79+ # ๋ง์ง๋ง์ผ๋ก ์ฌ์ฉํด ๋ณผ ``torchtext`` ์ ํนํ๋ ๊ธฐ๋ฅ์ ๋ฐ๋ก ``BucketIterator`` ์
๋๋ค.
80+ # ์ฒซ ๋ฒ์งธ ์ธ์๋ก ``TranslationDataset`` ์ ์ ๋ฌ๋ฐ๊ธฐ ๋๋ฌธ์ ์ฌ์ฉํ๊ธฐ๊ฐ ์ฝ์ต๋๋ค. ๋ฌธ์์์๋ ๋ณผ ์ ์๋ฏ
81+ # ์ด ๊ธฐ๋ฅ์ ๋น์ทํ ๊ธธ์ด์ ์์ ๋ค์ ๋ฌถ์ด์ฃผ๋ ๋ฐ๋ณต์(iterator)๋ฅผ ์ ์ํฉ๋๋ค. ๊ฐ๊ฐ์ ์๋ก์ด ์ํฌํฌ(epoch)๋ง๋ค
82+ # ์๋ก ์์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ง๋๋๋ฐ ํ์ํ ํจ๋ฉ์ ์๋ฅผ ์ต์ํ ํฉ๋๋ค. ๋ฒ์ผํ
๊ณผ์ ์์ ์ฌ์ฉ๋๋ ์ ์ฅ ๊ณต๊ฐ์ ํ๋ฒ ์ดํด๋ณด์๊ธฐ ๋ฐ๋๋๋ค.
9883
9984import torch
10085
10893 device = device )
10994
11095######################################################################
111- # These iterators can be called just like ``DataLoader``s; below, in
112- # the ``train`` and ``evaluate`` functions, they are called simply with:
113- #
96+ # ์ด ๋ฐ๋ณต์๋ค์ ``DataLoader`` ์ ๋ง์ฐฌ๊ฐ์ง๋ก ํธ์ถํ ์ ์์ต๋๋ค. ์๋ ``train`` ๊ณผ
97+ # ``evaluation`` ํจ์์์ ๋ณด๋ฉด, ๋ค์๊ณผ ๊ฐ์ด ๊ฐ๋จํ ํธ์ถํ ์ ์์์ ์ ์ ์์ต๋๋ค :
11498# ::
11599#
116100# for i, batch in enumerate(iterator):
117101#
118- # Each ``batch`` then has ``src`` and ``trg`` attributes:
102+ # ๊ฐ ``batch`` ๋ ``src`` ์ ``trg`` ์์ฑ์ ๊ฐ์ง๊ฒ ๋ฉ๋๋ค.
119103#
120104# ::
121105#
122106# src = batch.src
123107# trg = batch.trg
124108
125109######################################################################
126- # Defining our ``nn.Module`` and ``Optimizer``
110+ # ``nn.Module`` ๊ณผ ``Optimizer`` ์ ์ํ๊ธฐ
127111# ----------------
128- # That's mostly it from a ``torchtext`` perspecive: with the dataset built
129- # and the iterator defined, the rest of this tutorial simply defines our
130- # model as an ``nn.Module``, along with an ``Optimizer``, and then trains it.
112+ # ๋๋ถ๋ถ์ ``torchtext`` ๊ฐ ์์์ ํด์ค๋๋ค : ๋ฐ์ดํฐ์
์ด ๋ง๋ค์ด์ง๊ณ ๋ฐ๋ณต์๊ฐ ์ ์๋๋ฉด, ์ด ํํ ๋ฆฌ์ผ์์
113+ # ์ฐ๋ฆฌ๊ฐ ํด์ผ ํ ์ผ์ด๋ผ๊ณ ๋ ๊ทธ์ ``nn.Module`` ์ ``Optimizer`` ๋ฅผ ๋ชจ๋ธ๋ก์ ์ ์ํ๊ณ ํ๋ จ์ํค๋ ๊ฒ์ด ์ ๋ถ์
๋๋ค.
114+ #
131115#
132- # Our model specifically, follows the architecture described
133- # `here <https://arxiv.org/abs/1409.0473>`__ (you can find a
134- # significantly more commented version
135- # `here <https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb>`__).
136- #
137- # Note: this model is just an example model that can be used for language
138- # translation; we choose it because it is a standard model for the task,
139- # not because it is the recommended model to use for translation. As you're
140- # likely aware, state-of-the-art models are currently based on Transformers;
141- # you can see PyTorch's capabilities for implementing Transformer layers
142- # `here <https://pytorch.org/docs/stable/nn.html#transformer-layers>`__; and
143- # in particular, the "attention" used in the model below is different from
144- # the multi-headed self-attention present in a transformer model.
116+ # ์ด ํํ ๋ฆฌ์ผ์์ ์ฌ์ฉํ ๋ชจ๋ธ์ `์ด๊ณณ <https://arxiv.org/abs/1409.0473>`__ ์์ ์ค๋ช
ํ๊ณ ์๋ ๊ตฌ์กฐ๋ฅผ ๋ฐ๋ฅด๊ณ ์์ผ๋ฉฐ,
117+ # ๋ ์์ธํ ๋ด์ฉ์ `์ฌ๊ธฐ <https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb>`__
118+ # ๋ฅผ ์ฐธ๊ณ ํ์๊ธฐ ๋ฐ๋๋๋ค.
119+ #
120+ # ์ฐธ๊ณ : ์ด ํํ ๋ฆฌ์ผ์์ ์ฌ์ฉํ๋ ๋ชจ๋ธ์ ์ธ์ด ๋ฒ์ญ์ ์ํด ์ฌ์ฉํ ์์ ๋ชจ๋ธ์
๋๋ค. ์ด ๋ชจ๋ธ์ ์ฌ์ฉํ๋ ๊ฒ์
121+ # ์ด ์์
์ ์ ๋นํ ํ์ค ๋ชจ๋ธ์ด๊ธฐ ๋๋ฌธ์ด์ง, ๋ฒ์ญ์ ์ ํฉํ ๋ชจ๋ธ์ด๊ธฐ ๋๋ฌธ์ ์๋๋๋ค. ์ฌ๋ฌ๋ถ์ด ์ต์ ๊ธฐ์ ํธ๋ ๋๋ฅผ
122+ # ์ ๋ฐ๋ผ๊ฐ๊ณ ์๋ค๋ฉด ์ ์์๊ฒ ์ง๋ง, ํ์ฌ ๋ฒ์ญ์์ ๊ฐ์ฅ ๋ฐ์ด๋ ๋ชจ๋ธ์ Transformers์
๋๋ค. PyTorch๊ฐ
123+ # Transformer ๋ ์ด์ด๋ฅผ ๊ตฌํํ ๋ด์ฉ์ `์ฌ๊ธฐ <https://pytorch.org/docs/stable/nn.html#transformer-layers>`__
124+ # ์์ ํ์ธํ ์ ์์ผ๋ฉฐ ์ด ํํ ๋ฆฌ์ผ์ ๋ชจ๋ธ์ด ์ฌ์ฉํ๋ "attention" ์ Transformer ๋ชจ๋ธ์์ ์ ์ํ๋
125+ # ๋ฉํฐ ํค๋ ์
ํ ์ดํ
์
(multi-headed self-attention) ๊ณผ๋ ๋ค๋ฅด๋ค๋ ์ ์ ์๋ ค๋๋ฆฝ๋๋ค.
145126
146127
147128import random
@@ -316,7 +297,7 @@ def forward(self,
316297
317298 encoder_outputs , hidden = self .encoder (src )
318299
319- # first input to the decoder is the <sos> token
300+ # ๋์ฝ๋๋ก์ ์ฒซ ๋ฒ์งธ ์
๋ ฅ์ <sos> ํ ํฐ์
๋๋ค.
320301 output = trg [0 ,:]
321302
322303 for t in range (1 , max_len ):
@@ -376,16 +357,15 @@ def count_parameters(model: nn.Module):
376357print (f'The model has { count_parameters (model ):,} trainable parameters' )
377358
378359######################################################################
379- # Note: when scoring the performance of a language translation model in
380- # particular, we have to tell the ``nn.CrossEntropyLoss`` function to
381- # ignore the indices where the target is simply padding.
360+ # ์ฐธ๊ณ : ์ธ์ด ๋ฒ์ญ์ ์ฑ๋ฅ ์ ์๋ฅผ ๊ธฐ๋กํ๋ ค๋ฉด, ``nn.CrossEntropyLoss`` ํจ์๊ฐ ๋จ์ํ
361+ # ํจ๋ฉ์ ์ถ๊ฐํ๋ ๋ถ๋ถ์ ๋ฌด์ํ ์ ์๋๋ก ํด๋น ์์ธ๋ค์ ์๋ ค์ค์ผ ํฉ๋๋ค.
382362
383363PAD_IDX = TRG .vocab .stoi ['<pad>' ]
384364
385365criterion = nn .CrossEntropyLoss (ignore_index = PAD_IDX )
386366
387367######################################################################
388- # Finally, we can train and evaluate this model :
368+ # ๋ง์ง๋ง์ผ๋ก ์ด ๋ชจ๋ธ์ ํ๋ จํ๊ณ ํ๊ฐํฉ๋๋ค :
389369
390370import math
391371import time
@@ -486,11 +466,8 @@ def epoch_time(start_time: int,
486466print (f'| Test Loss: { test_loss :.3f} | Test PPL: { math .exp (test_loss ):7.3f} |' )
487467
488468######################################################################
489- # Next steps
469+ # ๋ค์ ๋จ๊ณ
490470# --------------
491471#
492- # - Check out the rest of Ben Trevett's tutorials using ``torchtext``
493- # `here <https://github.com/bentrevett/>`__
494- # - Stay tuned for a tutorial using other ``torchtext`` features along
495- # with ``nn.Transformer`` for language modeling via next word prediction!
496- #
472+ # - ``torchtext`` ๋ฅผ ์ฌ์ฉํ Ben Trevett์ ํํ ๋ฆฌ์ผ์ `์ด๊ณณ <https://github.com/bentrevett/>`__ ์์ ํ์ธํ ์ ์์ต๋๋ค.
473+ # - ``nn.Transformer`` ์ ``torchtext`` ์ ๋ค๋ฅธ ๊ธฐ๋ฅ๋ค์ ์ด์ฉํ ๋ค์ ๋จ์ด ์์ธก์ ํตํ ์ธ์ด ๋ชจ๋ธ๋ง ํํ ๋ฆฌ์ผ์ ์ดํด๋ณด์ธ์.
0 commit comments