Skip to content

Commit 1ab01bd

Browse files
mryabfdrose
andcommitted
Add week03 materials
Co-authored-by: fdrose <110309049+fdrose@users.noreply.github.com>
1 parent e69376b commit 1ab01bd

33 files changed

+4394
-1
lines changed

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,9 @@ __This branch corresponds to the ongoing 2025 course. If you want to see full ma
1010
- [__Week 2:__](./week02_management_and_testing) __Experiment tracking, model and data versioning, testing DL code in Python__
1111
- Lecture: Experiment management basics and pipeline versioning. Configuring Python applications. Intro to regular and property-based testing.
1212
- Seminar: Example DVC+Weights & Biases project walkthrough. Intro to testing with pytest.
13-
- __Week 3:__ __Training optimizations, profiling DL code__
13+
- [__Week 3:__ ](./week03_fast_pipelines) __Training optimizations, FP16/BF16/FP8 formats, profiling deep learning code__
14+
- Lecture: Measuring performance of GPU-accelerated software. Mixed-precision training. Data storage and loading optimizations. Tools for profiling deep learning workloads.
15+
- Seminar: Automatic Mixed Precision in PyTorch. Dynamic padding for sequence data and JPEG decoding benchmarks. Basics of profiling with py-spy, PyTorch Profiler, Memory Snapshot and Nsight Systems.
1416
- __Week 4:__ __Data-parallel training and All-Reduce__
1517
- __Week 5:__ __Sharded data-parallel training, distributed training optimizations__
1618
- __Week 6:__ __Training large models__
@@ -32,6 +34,7 @@ Please refer to the course page of your institution for details.
3234
- [Max Ryabinin](https://github.com/mryab)
3335
- [Just Heuristic](https://github.com/justheuristic)
3436
- [Yaroslav Zolotarev](https://github.com/Q-c7)
37+
- [Maksim Abraham](https://github.com/fdrose)
3538
- [Gregory Leleytner](https://github.com/RunFMe)
3639
- [Antony Frolov](https://github.com/antony-frolov)
3740
- [Anton Chigin](https://github.com/achigin)

week03_fast_pipelines/README.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Week 3: Training optimizations, profiling DL code
2+
3+
* Lecture: [slides](./lecture.pdf)
4+
* Seminar: [folder](./seminar)
5+
* Homework: see [homework/README.md](homework/README.md)
6+
7+
## Further reading
8+
* [Blog post about reduced precision FP formats](https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407)
9+
* NVIDIA blog posts about [mixed precision training with Tensor Cores](https://developer.nvidia.com/blog/video-mixed-precision-techniques-tensor-cores-deep-learning/), [Tensor Core performance tips](https://developer.nvidia.com/blog/optimizing-gpu-performance-tensor-cores/), [TF32 Tensor Cores](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)
10+
* Presentations about Tensor Cores: [one](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9926-tensor-core-performance-the-ultimate-guide.pdf), [two](https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21929-tensor-core-performance-on-nvidia-gpus-the-ultimate-guide.pdf), [three](https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf)
11+
* [Tensor Core Requirements](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc) and [Mixed Precision Training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#mptrain) sections of the [NVIDIA DL performance guide](https://docs.nvidia.com/deeplearning/performance/index.html)
12+
* [Automatic Mixed Precision in PyTorch](https://pytorch.org/docs/stable/amp.html)
13+
* [TF32 section of PyTorch CUDA docs](https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
14+
* [FP8 Formats for Deep Learning paper](https://arxiv.org/abs/2209.05433)
15+
* [PyTorch Architecture Optimization](https://github.com/pytorch/ao) for FP8 training and other optimizations
16+
* [Float8 in PyTorch discussion](https://dev-discuss.pytorch.org/t/float8-in-pytorch-1-x/1815)
17+
* [AMP](https://www.deepspeed.ai/docs/config-json/#automatic-mixed-precision-amp-training-options), [FP16](https://www.deepspeed.ai/docs/config-json/#fp16-training-options) and [BF16](https://www.deepspeed.ai/docs/config-json/#bfloat16-training-options) in DeepSpeed
18+
* [PyTorch Performance Tuning Guide](https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#)
19+
* [Latency Numbers Every Programmer Should Know](https://colin-scott.github.io/personal_website/research/interactive_latency.html)
20+
* [Pillow Performance benchmarks](https://python-pillow.org/pillow-perf/)
21+
* [Faster Image Processing](https://fastai1.fast.ai/performance.html#faster-image-processing) tips from fastai docs
22+
* [Rapid Data Pre-Processing with NVIDIA DALI](https://developer.nvidia.com/blog/rapid-data-pre-processing-with-nvidia-dali/)
23+
* General-purpose Python profilers: [builtins (cProfile and profile)](https://docs.python.org/3/library/profile.html), [pyinstrument](https://github.com/joerick/pyinstrument), [memory_profiler](https://github.com/pythonprofilers/memory_profiler), [py-spy](https://github.com/benfred/py-spy), [Scalene](https://github.com/plasma-umass/scalene)
24+
* [DLProf user guide](https://docs.nvidia.com/deeplearning/frameworks/dlprof-user-guide/index.html)
25+
* [How to profile with DLProf](https://tigress-web.princeton.edu/~jdh4/how_to_profile_with_dlprof_may_2021.pdf)
26+
* [Profiling and Optimizing Deep Neural Networks with DLProf and PyProf](https://developer.nvidia.com/blog/profiling-and-optimizing-deep-neural-networks-with-dlprof-and-pyprof/)
27+
* NVIDIA presentations on [profiling DL networks](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9339-profiling-deep-learning-networks.pdf), [profiling for DL and mixed precision](https://on-demand.gputechconf.com/gtc-cn/2019/pdf/CN9620/presentation.pdf)
28+
* [Profiling Deep Learning Workloads](https://extremecomputingtraining.anl.gov/files/2020/08/ATPESC-2020-Track-8-Talk-7-Emani-ProfilingDLWorkloads.pdf)
29+
* [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) and [PyTorch Profiler with TensorBoard](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html) tutorial
30+
* [torch.utils.bottleneck quick guide](https://pytorch.org/docs/stable/bottleneck.html)
31+
* [PyTorch Autograd profiler tutorial](https://pytorch.org/tutorials/beginner/profiler.html)
32+
* [Nsight Systems](https://docs.nvidia.com/nsight-systems/UserGuide/index.html) and [Nsight Compute](https://docs.nvidia.com/nsight-compute/2022.1/index.html) user guides
33+
* [Video tutorial about speeding up and profiling neural networks](https://www.youtube.com/watch?v=ySGIaOb_RDY)
34+
* [Solving Machine Learning Performance Anti-Patterns: a Systematic Approach](https://paulbridger.com/posts/nsight-systems-systematic-optimization/)
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# Week 3 home assignment
2+
3+
The assignment for this week consists of three parts: all parts are obligatory, no bonus tasks are given, but you can earn more than 10 points in total.
4+
Implement your solutions in the folders for the corresponding tasks.
5+
Create a report for your homework: briefly describe
6+
the structure of your solution for each section, include benchmark results in the tables, and provide explanations of the observed results.
7+
Poorly written reports will give you a reduced grade for the assignment!
8+
9+
Make sure to install the necessary packages from `requirements.txt` in the week's folder.
10+
11+
## Submission format
12+
- For the report, you need to create an `.ipynb` or a `.pdf` file.
13+
- Create a `.zip` archive that contains:
14+
- Folders with your solutions for each task
15+
- The report file with instructions on how to run each part, results of running the code and (when necessary) your analysis
16+
- Upload this archive when submitting the assignment
17+
18+
## Task 1: DIY loss scaling (2 points)
19+
Implement [loss scaling](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#lossscaling) for the AMP training mode.
20+
Use the provided semantic segmentation pipeline in [`task1`](./task1).
21+
Your task is to train the model in the AMP mode with loss scaler implemented by you.
22+
You **can use** `torch.cuda.amp.autocast`, and you **cannot use** `torch.cuda.amp.GradScaler()` (you may only for checking your solution).
23+
24+
Let us recall what loss scaling is.
25+
Loss scaling is used to avoid the gradient underflow problem when computing gradients in FP16 precision.
26+
The issue here is that while training in full precision, we might acquire rather small values in the gradients, which will vanish when we cast a tensor to a half precision.
27+
To fix the problem, we use the following solution:
28+
29+
- Make a forward pass for the model and compute the loss
30+
- Multiply the loss value by some factor
31+
- Call `.backward()`
32+
- Update the model's master weights with **unscaled** FP32 gradients
33+
34+
Loss scaling might be done in two different ways: static and dynamic one.
35+
In the static mode, you choose a factor for scaling only once and use it for the whole training procedure.
36+
In the dynamic mode, you recompute the factor each time you scale the loss.
37+
38+
### Task
39+
- Implement static loss scaling (**1 point**)
40+
- Implement dynamic loss scaling (**1 point**)
41+
42+
The task is done if you manage to stably achieve high accuracy values (0.985+) within 5 training epochs.
43+
Note that you need to implement and successfully train with **both** scaling modes if you want to get a full grade for this task.
44+
As a starting point, you can run the training in the full precision mode, then try to run in the AMP mode with and without the PyTorch loss scaler.
45+
You will observe that adding a scaler gives you additional accuracy points.
46+
47+
**Hint:** To make sure that you're doing everything right, you might want to examine the values of gradients: (almost) no zeros should be present there.
48+
49+
### Report instructions
50+
When you are done with the code, you can either:
51+
- Run the training function with implemented scaling modes in an `.ipynb` report
52+
- Include training logs AND instructions on how to run your code in a `.pdf` report
53+
54+
## Task 2: efficient batching for language modeling (4 points)
55+
In this part, you need to examine the efficiency of the four batching approaches we discussed during the seminar.
56+
Let us remind you of them shortly:
57+
58+
**BRAIN**: pad everything to a fixed `max_length`
59+
60+
**BIG BRAIN**: pad only in the `collate_fn`
61+
62+
**ULTRA BIG BRAIN**: group examples of similar length into buckets, and sample examples for every batch from a single bucket
63+
64+
**ULTRA DUPER BIG BRAIN**: pack all sequences into one long sequence and generate metadata that indicates where each original sequence starts and ends
65+
66+
### Task
67+
More formally, you need to download [WikiText-103 dataset](https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip) and implement all the mentioned approaches.
68+
Use only the training subset for all the task's subproblems.
69+
70+
- For naive batching, implement a Pytorch `Dataset` class that will parse training data from the source files of the dataset and pad every sample to a fixed `max_length=640`. **(0.5 points)**
71+
- For the second approach, reimplement the `collate_fn` demo from the seminar for this dataset. **(0.5 points)**
72+
More specifically, you need to pad sequences only up to a maximum sample length in the current batch.
73+
- For the third approach, implement the `UltraBigBrainDataset` and the `UltraBigBrainBatchSampler` classes. **(1.5 points)**
74+
Objects of the `BatchSampler` class are iterables and yield a list of indices that correspond to dataset objects, which are put into a batch.
75+
You can pass this batch sampler to a `DataLoader`.
76+
For more information, refer to PyTorch [docs](https://pytorch.org/docs/stable/data.html#automatic-batching-default).
77+
Objects in each batch should have the same or similar length.
78+
Sample batches randomly, but ensure that the length difference between the longest and shortest samples is less than or equal to k (try different values of k: 1, 5, 10, 20, 50).
79+
Note that some batches may be shorter than the specified batch size.
80+
The `__init__` method must work in O(n) time, where n is the length of the dataset.
81+
The `__iter__` call must work in O(1) time with respect to the size of the dataset (and obviously, in O(batch_size)).
82+
While processing the dataset, put all possible lengths of the samples into a hash table, where keys are lengths and values are containers with the indices of samples of this length.
83+
- For the fourth approach, we recommend to use `IterableDataset`, which is a good choice when we don't know how many samples we need to create a batch. **(1.5 points)**
84+
If the last sample is too long, you can either truncate it or drop it from the dataset.
85+
Don't forget that you also need to build a correct attention mask to prevent cross-contamination of training examples and pass it to the model!
86+
87+
For each of the implemented methods (and all variations of the third method), mock one training epoch and measure minimum, maximum, mean and median batch processing times.
88+
To mock a training epoch, you need to construct a small GPT-2-like model: use `nn.Embedding` layer, `PositionalEncoding` class from `transformer.py` file and a single `nn.TransformerDecoder` layer with a hidden size of 1024 and 8 heads.
89+
For tokenization, use `torchtext.data.utils.get_tokenizer("basic_english")`.
90+
Run one epoch **without a backward pass**.
91+
Make sure you've [warmed up](https://forums.developer.nvidia.com/t/why-warm-up/48565) the GPU before computing the statistics and do not forget about asynchronous CUDA kernel execution.
92+
93+
Keep in mind that all padding in this task must be **implemented by you**: unlike the seminar, PyTorch’s default collate padding is not allowed.
94+
In every subproblem, for sequences longer than 640 tokens, just truncate the overflowing part.
95+
Feel free to modify the keyword arguments of functions.
96+
97+
**Hint:** In the third subtask, you might want to use a hash table multiple times.
98+
**Hint 2:** In the third subtask, when `k=640`, you should receive the same results as in Subtask 2.
99+
100+
### Report instructions
101+
When you are done with the code, you can either:
102+
- Display the benchmark results in a `pandas.DataFrame` in your `.ipynb` report
103+
- Display the benchmark results in a table in your `.pdf` report
104+
105+
## Task 3 (5 points)
106+
You are given a training script for a [Vision Transformer model](https://huggingface.co/docs/transformers/model_doc/vit) on the [Clothing dataset](https://www.kaggle.com/datasets/agrigorev/clothing-dataset-full).
107+
In this task, you need to implement a custom profiler to measure the performance of PyTorch models at the layer level.
108+
The profiler should track the execution time of each layer during the forward and backward passes and output results in a trace event format.
109+
You also need to examine the bottlenecks of the training pipeline, including the model and the training loop (you can use any profilers you want here).
110+
The implementation of the model is based on the [`lucidrains/vit-pytorch`](https://github.com/lucidrains/vit-pytorch) repository.
111+
112+
### Task
113+
- Implement a basic profiler: (**2.5 points**)
114+
- Implement a [context manager](https://book.pythontips.com/en/latest/context_managers.html) to collect execution times for each layer. You have a skeleton of the `Profile` class, feel free to modify or extend it. We are only doing **layer-level** profiling here (not kernel-level).
115+
- Support **profiling schedule phases** (e.g., wait, warmup, active), similar to the [PyTorch profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#using-profiler-to-analyze-long-running-jobs).
116+
- Implement a `to_perfetto` method that exports data in the [trace event format](https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview?tab=t.0#heading=h.yr4qxyxotyw) which is compatible with [Perfetto](https://ui.perfetto.dev/).
117+
- Profile a ViT model for several training iterations using your custom profiler. Visualize the results in the Perfetto UI. Compare your profiler's layer timings with those from the native PyTorch profiler (Don’t forget a warm-up phase!). **Report** any differences you observe in the measured times.
118+
119+
- Profile CUDA kernels now: (**1 point**)
120+
- Update your profiler: insert **NVTX markers** via `torch.cuda.nvtx`. This will let you see **individual CUDA kernels** in the timeline when using Nsight Systems. **Remove any explicit synchronization**, because Nsight Systems can capture kernel timings directly from the GPU.
121+
- Run your script with **Nsight Systems**:
122+
```bash
123+
nsys profile --env-var CUDA_VISIBLE_DEVICES="YOUR_GPU_ID" -o trace python3 main.py
124+
```
125+
- Open the resulting **`.nsys-rep`** file in Nsight Systems. Examine kernel-level details in the GPU timeline. **Report** whether you see any timing differences compared to your earlier runs. If you see any difference, can you explain the reasons?
126+
127+
- Profile model performance during training, find deliberate inefficiencies we've left in the code, and fix them: (**1.5 points**)
128+
- There is a total of 6 inefficiencies, you will get 0.25 points for each one you find
129+
- We expect that in your analysis, you will not only examine the time and memory consumption, but also provide explanations of
130+
whether the obtained results are reasonable.
131+
132+
**Hints:**
133+
- Use PyTorch's forward and backward hooks to collect execution times for each module in the model.
134+
- Use `torch.cuda.synchronize()` and `torch.cuda.Event()` correctly to ensure GPU kernels complete before recording events, since all GPU operations are asynchronous ([Asynchronous Execution](https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution)).
135+
- Inefficiencies could be anywhere in the code: they may be in data processing, model performance, the training loop — you name it.
136+
- You might want to look at the trace of operations instead of just per-operation profiling, as there is a lot of useful information.
137+
138+
### Report instructions
139+
When you are done with investigations and fixes, you can either:
140+
- Report the profiler output AND its meaningful analysis in your `.ipynb` report file.
141+
List the fixes you made to the code. Be sure to describe how you found them, why the code was inefficient (with profiler screenshots/outputs), and why suggested fixes help.
142+
- The same applies to the `.pdf` file, if you decide to submit your report in that format.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
datasets==2.9.0
2+
imageio==2.25.0
3+
jpeg4py==0.1.4
4+
nvprof==0.2
5+
opencv-python==4.7.0.68
6+
scikit-image==0.19.3
7+
pandas==1.5.3
8+
py-spy==0.3.14
9+
einops==0.7.0
10+
torch==2.4.0
11+
torchtext
12+
torchvision==0.17.0
13+
tqdm==4.64.1
14+
vit_pytorch==0.40.2
15+
gdown==4.7.3
16+
matplotlib==3.8.2

0 commit comments

Comments
 (0)