|
| 1 | +# Week 4 home assignment |
| 2 | + |
| 3 | +The assignment for this week consists of four parts: the first three are obligatory, and the fourth is a bonus one. |
| 4 | +Include all the files with implemented functions/classes and the report for Tasks 2 and 4 in your submission. |
| 5 | + |
| 6 | +## Task 1 (1 point) |
| 7 | + |
| 8 | +Implement the function for deterministic sequential printing of N numbers for N processes, |
| 9 | +using [sequential_print.py](./sequential_print.py) as a template. |
| 10 | +You should be able to test it with `torchrun --nproc_per_node N sequential_print.py` |
| 11 | +Pay attention to the output format! |
| 12 | + |
| 13 | +## Task 2 (7 points) |
| 14 | + |
| 15 | +The pipeline you saw in the seminar shows only the basic building blocks of distributed training. Now, let's train |
| 16 | +something actually interesting! |
| 17 | + |
| 18 | +### SyncBatchNorm implementation |
| 19 | +For this task, let's take the [CIFAR-100](https://pytorch.org/vision/0.8/datasets.html#torchvision.datasets.CIFAR100) |
| 20 | +dataset and train a model with **synchronized** Batch Normalization: this version of the layer aggregates |
| 21 | +the statistics **across all workers** during each forward pass. |
| 22 | + |
| 23 | +Importantly, you have to call a communication primitive **only once** during each forward or backward pass; |
| 24 | +if you use it more than once, you will only earn up to 4 points for this task. |
| 25 | +Additionally, you are **not allowed** to use internal PyTorch functions that compute the backward pass |
| 26 | +of batch normalization: please implement it manually. |
| 27 | + |
| 28 | +### Reducing gradient synchronization |
| 29 | +Also, implement a version of distributed training which is aware of **gradient accumulation**: |
| 30 | +for every batch that doesn't run `optimizer.step`, you do not need to run All-Reduce for gradients at all. |
| 31 | + |
| 32 | +### Benchmarking the training pipeline |
| 33 | +Compare the performance (in terms of speed, memory footprint, and final quality) of your distributed training |
| 34 | +pipeline with the one that uses primitives from PyTorch (i.e., [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) **and** [torch.nn.SyncBatchNorm](https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html)). |
| 35 | +You need to compare the implementations by training with **at least two** processes, and your pipeline needs to have |
| 36 | +at least 2 gradient accumulation steps. |
| 37 | + |
| 38 | +### Tests for SyncBatchNorm |
| 39 | +In addition, **test the SyncBN layer itself** by comparing the results with standard **BatchNorm1d** and changing |
| 40 | +the number of workers (1 and 4), the size of activations (128, 256, 512, 1024), and the batch size (32, 64). |
| 41 | + |
| 42 | +Compare the results of forward/backward passes in the following setup: |
| 43 | +* FP32 inputs come from the standard Gaussian distribution; |
| 44 | +* The loss function takes the outputs of batch normalization and computes the sum over all dimensions |
| 45 | +for first B/2 samples (B is the total batch size). |
| 46 | + |
| 47 | +A working implementation of SyncBN should have reasonably low `atol` (at least 1e-3) and `rtol` equal to 0. |
| 48 | + |
| 49 | +This test needs to be implemented via `pytest` in [test_syncbn.py](./test_syncbn.py): in particular, all the above |
| 50 | +parameters (including the number of workers) need to be the inputs of that test. |
| 51 | +Therefore, you will need to **start worker processes** within the test as well: `test_batchnorm` contains helpful |
| 52 | +comments to get you started. |
| 53 | +The test can be implemented to work only on the CPU for simplicity. |
| 54 | + |
| 55 | +### Performance benchmarks |
| 56 | +Finally, measure the GPU time (2+ workers) and the memory footprint of standard **SyncBatchNorm** |
| 57 | +and your implementation in the above setup: in total, you should have 8 speed/memory benchmarks for each implementation. |
| 58 | + |
| 59 | +### Submission format |
| 60 | +Provide the results of your experiments in a `.ipynb`/`.pdf` report and attach it to your code |
| 61 | +when submitting the homework. |
| 62 | +Your report should include a brief experimental setup (if changed), results of all experiments **with the commands/code |
| 63 | +to reproduce them**, and the infrastructure description (version of PyTorch, number of processes, type of GPUs, etc.). |
| 64 | + |
| 65 | +Use [syncbn.py](./syncbn.py) and [ddp_cifar100.py](./ddp_cifar100.py) as a template. |
| 66 | + |
| 67 | +## Task 3 (2 points) |
| 68 | + |
| 69 | +Until now, we only aggregated the gradients across different workers during training. But what if we want to run |
| 70 | +distributed validation on a large dataset as well? In this assignment, you have to implement distributed metric |
| 71 | +aggregation: shard the dataset across different workers (with [scatter](https://pytorch.org/docs/stable/distributed.html#torch.distributed.scatter)), compute accuracy for each subset on |
| 72 | +its respective worker and then average the metric values on the master process. |
| 73 | + |
| 74 | +Also, make one more quality-of-life improvement of the pipeline by logging the loss (and accuracy!) |
| 75 | +only from the rank-0 process to avoid flooding the standard output of your training command. |
| 76 | +Submit the training code that includes all enhancements from Tasks 2 and 3. |
| 77 | + |
| 78 | +## Task 4 (bonus, 3 points) |
| 79 | + |
| 80 | +Using [allreduce.py](./allreduce.py) as a template, implement the Ring All-Reduce algorithm |
| 81 | +using only point-to-point communication primitives from `torch.distributed`. |
| 82 | +Compare it with the provided implementation of Butterfly All-Reduce |
| 83 | +and with `torch.distributed.all_reduce` in terms of CPU speed, memory usage and the accuracy of averaging. |
| 84 | +Specifically, compare custom implementations of All-Reduce with 1–32 workers and compare your implementation of |
| 85 | +Ring All-Reduce with `torch.distributed.all_reduce` on 1–16 processes and vectors of 1,000–100,000 elements. |
0 commit comments