Skip to content

Commit fa050e1

Browse files
authored
Change recommended tokenizer (#30)
1 parent 1704b52 commit fa050e1

File tree

3 files changed

+3
-1
lines changed

3 files changed

+3
-1
lines changed

week03_fast_pipelines/homework/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ Don't forget that you also need to build a correct attention mask to prevent cro
8686

8787
For each of the implemented methods (and all variations of the third method), mock one training epoch and measure minimum, maximum, mean and median batch processing times.
8888
To mock a training epoch, you need to construct a small GPT-2-like model: use `nn.Embedding` layer, `PositionalEncoding` class from `transformer.py` file and a single `nn.TransformerDecoder` layer with a hidden size of 1024 and 8 heads.
89-
For tokenization, use `torchtext.data.utils.get_tokenizer("basic_english")`.
89+
For tokenization, use `.tokenize()` method of `AutoTokenizer.from_pretrained("bert-base-uncased")`.
9090
Run one epoch **without a backward pass**.
9191
Make sure you've [warmed up](https://forums.developer.nvidia.com/t/why-warm-up/48565) the GPU before computing the statistics and do not forget about asynchronous CUDA kernel execution.
9292

week03_fast_pipelines/homework/requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ torch==2.4.0
1111
torchtext
1212
torchvision==0.19.0
1313
tqdm==4.64.1
14+
transformers==4.48.2
1415
vit_pytorch==0.40.2
1516
gdown==4.7.3
1617
matplotlib==3.8.2

week03_fast_pipelines/homework/task2/dataset.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
import torch
44
from torch.utils.data.dataset import Dataset
55
from torch.utils.data import Sampler, IterableDataset
6+
from transformers import AutoTokenizer
67

78

89
MAX_LENGTH = 640

0 commit comments

Comments
 (0)