Change recommended tokenizer (#30)

fdrose · web-flow · commit fa050e190f6a · 2025-02-06T09:11:41.000Z
diff --git a/week03_fast_pipelines/homework/README.md b/week03_fast_pipelines/homework/README.md
@@ -86,7 +86,7 @@ Don't forget that you also need to build a correct attention mask to prevent cro
 
 For each of the implemented methods (and all variations of the third method), mock one training epoch and measure minimum, maximum, mean and median batch processing times.
 To mock a training epoch, you need to construct a small GPT-2-like model: use `nn.Embedding` layer, `PositionalEncoding` class from `transformer.py` file and a single `nn.TransformerDecoder` layer with a hidden size of 1024 and 8 heads.
-For tokenization, use `torchtext.data.utils.get_tokenizer("basic_english")`.
+For tokenization, use `.tokenize()` method of `AutoTokenizer.from_pretrained("bert-base-uncased")`.
 Run one epoch **without a backward pass**.
 Make sure you've [warmed up](https://forums.developer.nvidia.com/t/why-warm-up/48565) the GPU before computing the statistics and do not forget about asynchronous CUDA kernel execution.
 
diff --git a/week03_fast_pipelines/homework/requirements.txt b/week03_fast_pipelines/homework/requirements.txt
@@ -11,6 +11,7 @@ torch==2.4.0
 torchtext
 torchvision==0.19.0
 tqdm==4.64.1
+transformers==4.48.2
 vit_pytorch==0.40.2
 gdown==4.7.3
 matplotlib==3.8.2
diff --git a/week03_fast_pipelines/homework/task2/dataset.py b/week03_fast_pipelines/homework/task2/dataset.py
@@ -3,6 +3,7 @@
 import torch
 from torch.utils.data.dataset import Dataset
 from torch.utils.data import Sampler, IterableDataset
+from transformers import AutoTokenizer
 
 
 MAX_LENGTH = 640