mryab
diff --git a/‎README.md‎
Lines changed: 3 additions & 1 deletion b/‎README.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎week04_data_parallel/README.md‎
Lines changed: 16 additions & 0 deletions b/‎week04_data_parallel/README.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎week04_data_parallel/homework/README.md‎
Lines changed: 85 additions & 0 deletions b/‎week04_data_parallel/homework/README.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎week04_data_parallel/homework/allreduce.py‎
Lines changed: 94 additions & 0 deletions b/‎week04_data_parallel/homework/allreduce.py‎
Lines changed: 94 additions & 0 deletions
diff --git a/‎week04_data_parallel/homework/ddp_cifar100.py‎
Lines changed: 111 additions & 0 deletions b/‎week04_data_parallel/homework/ddp_cifar100.py‎
Lines changed: 111 additions & 0 deletions
diff --git a/‎week04_data_parallel/homework/requirements.txt‎
Lines changed: 3 additions & 0 deletions b/‎week04_data_parallel/homework/requirements.txt‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎week04_data_parallel/homework/sequential_print.py‎
Lines changed: 29 additions & 0 deletions b/‎week04_data_parallel/homework/sequential_print.py‎
Lines changed: 29 additions & 0 deletions
@@ -13,7 +13,9 @@ __This branch corresponds to the ongoing 2025 course. If you want to see full ma
 - [__Week 3:__ ](./week03_fast_pipelines) __Training optimizations, FP16/BF16/FP8 formats, profiling deep learning code__
   - Lecture: Measuring performance of GPU-accelerated software. Mixed-precision training. Data storage and loading optimizations. Tools for profiling deep learning workloads. 
   - Seminar: Automatic Mixed Precision in PyTorch. Dynamic padding for sequence data and JPEG decoding benchmarks. Basics of profiling with py-spy, PyTorch Profiler, Memory Snapshot and Nsight Systems.
-- __Week 4:__ __Data-parallel training and All-Reduce__
+- [__Week 4:__](./week04_data_parallel) __Data-parallel training and All-Reduce__
+  - Lecture: Introduction to distributed training. Data-parallel training of neural networks. All-Reduce and its efficient implementations.
+  - Seminar: Introduction to PyTorch Distributed. Data-parallel training primitives.
 - __Week 5:__ __Sharded data-parallel training, distributed training optimizations__
 - __Week 6:__ __Training large models__
 - __Week 7:__ __Python web application deployment__
 
@@ -0,0 +1,16 @@
+# Week 4: Data-parallel training and All-Reduce
+
+* Lecture: [link](./lecture.pdf)
+* Seminar: [link](./practice.ipynb)
+* Homework: see the [homework](./homework) folder
+
+## Further reading
+* [Numba parallel](https://numba.pydata.org/numba-doc/dev/user/parallel.html) - a way to develop threaded parallel code in python without GIL
+* [joblib](https://joblib.readthedocs.io/) - a library of multiprocessing primitives similar to mp.Pool, but with some extra conveniences
+* BytePS paper - https://www.usenix.org/system/files/osdi20-jiang.pdf
+* Alternative lecture: Parameter servers from CMU 10-605 - [here](https://www.youtube.com/watch?v=N241lmq5mqk)
+* Alternative seminar: python multiprocessing - [playlist](https://www.youtube.com/watch?v=RR4SoktDQAw&list=PL5tcWHG-UPH3SX16DI6EP1FlEibgxkg_6)
+* [Python multiprocessing docs](https://docs.python.org/3/library/multiprocessing.html) (pay attention to `fork` vs `spawn`!)
+* [PyTorch Distributed tutorial](https://pytorch.org/tutorials/intermediate/dist_tuto.html)
+* [Collective communication protocols in NCCL](https://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf)
+* There's a ton of links on the slides, please check the PDF.
@@ -0,0 +1,85 @@
+# Week 4 home assignment
+
+The assignment for this week consists of four parts: the first three are obligatory, and the fourth is a bonus one.
+Include all the files with implemented functions/classes and the report for Tasks 2 and 4 in your submission.
+
+## Task 1 (1 point)
+
+Implement the function for deterministic sequential printing of N numbers for N processes,
+using [sequential_print.py](./sequential_print.py) as a template. 
+You should be able to test it with `torchrun --nproc_per_node N sequential_print.py`
+Pay attention to the output format!
+
+## Task 2 (7 points)
+
+The pipeline you saw in the seminar shows only the basic building blocks of distributed training. Now, let's train
+something actually interesting!
+
+### SyncBatchNorm implementation
+For this task, let's take the [CIFAR-100](https://pytorch.org/vision/0.8/datasets.html#torchvision.datasets.CIFAR100)
+dataset and train a model with **synchronized** Batch Normalization: this version of the layer aggregates 
+the statistics **across all workers** during each forward pass.
+
+Importantly, you have to call a communication primitive **only once** during each forward or backward pass; 
+if you use it more than once, you will only earn up to 4 points for this task.
+Additionally, you are **not allowed** to use internal PyTorch functions that compute the backward pass
+of batch normalization: please implement it manually.
+
+### Reducing gradient synchronization
+Also, implement a version of distributed training which is aware of **gradient accumulation**:
+for every batch that doesn't run `optimizer.step`, you do not need to run All-Reduce for gradients at all.
+
+### Benchmarking the training pipeline
+Compare the performance (in terms of speed, memory footprint, and final quality) of your distributed training 
+pipeline with the one that uses primitives from PyTorch (i.e., [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel) **and** [torch.nn.SyncBatchNorm](https://pytorch.org/docs/stable/generated/torch.nn.SyncBatchNorm.html)). 
+You need to compare the implementations by training with **at least two** processes, and your pipeline needs to have 
+at least 2 gradient accumulation steps.
+
+### Tests for SyncBatchNorm
+In addition, **test the SyncBN layer itself** by comparing the results with standard **BatchNorm1d** and changing 
+the number of workers (1 and 4), the size of activations (128, 256, 512, 1024), and the batch size (32, 64). 
+
+Compare the results of forward/backward passes in the following setup: 
+* FP32 inputs come from the standard Gaussian distribution;
+* The loss function takes the outputs of batch normalization and computes the sum over all dimensions 
+for first B/2 samples (B is the total batch size).
+
+A working implementation of SyncBN should have reasonably low `atol` (at least 1e-3) and `rtol` equal to 0.
+
+This test needs to be implemented via `pytest` in [test_syncbn.py](./test_syncbn.py): in particular, all the above 
+parameters (including the number of workers) need to be the inputs of that test.
+Therefore, you will need to **start worker processes** within the test as well: `test_batchnorm` contains helpful 
+comments to get you started.
+The test can be implemented to work only on the CPU for simplicity.
+
+### Performance benchmarks
+Finally, measure the GPU time (2+ workers) and the memory footprint of standard **SyncBatchNorm** 
+and your implementation in the above setup: in total, you should have 8 speed/memory benchmarks for each implementation.
+
+### Submission format
+Provide the results of your experiments in a `.ipynb`/`.pdf` report and attach it to your code 
+when submitting the homework.
+Your report should include a brief experimental setup (if changed), results of all experiments **with the commands/code 
+to reproduce them**, and the infrastructure description (version of PyTorch, number of processes, type of GPUs, etc.).
+
+Use [syncbn.py](./syncbn.py) and [ddp_cifar100.py](./ddp_cifar100.py) as a template. 
+
+## Task 3 (2 points)
+
+Until now, we only aggregated the gradients across different workers during training. But what if we want to run
+distributed validation on a large dataset as well? In this assignment, you have to implement distributed metric
+aggregation: shard the dataset across different workers (with [scatter](https://pytorch.org/docs/stable/distributed.html#torch.distributed.scatter)), compute accuracy for each subset on 
+its respective worker and then average the metric values on the master process.
+
+Also, make one more quality-of-life improvement of the pipeline by logging the loss (and accuracy!) 
+only from the rank-0 process to avoid flooding the standard output of your training command. 
+Submit the training code that includes all enhancements from Tasks 2 and 3.
+
+## Task 4 (bonus, 3 points)
+
+Using [allreduce.py](./allreduce.py) as a template, implement the Ring All-Reduce algorithm
+using only point-to-point communication primitives from `torch.distributed`. 
+Compare it with the provided implementation of Butterfly All-Reduce
+and with `torch.distributed.all_reduce` in terms of CPU speed, memory usage and the accuracy of averaging. 
+Specifically, compare custom implementations of All-Reduce with 1–32 workers and compare your implementation of 
+Ring All-Reduce with `torch.distributed.all_reduce` on 1–16 processes and vectors of 1,000–100,000 elements.
@@ -0,0 +1,94 @@
+import os
+import random
+
+import torch
+import torch.distributed as dist
+from torch.multiprocessing import Process
+
+
+def init_process(rank, size, fn, master_port, backend="gloo"):
+    """Initialize the distributed environment."""
+    os.environ["MASTER_ADDR"] = "127.0.0.1"
+    os.environ["MASTER_PORT"] = str(master_port)
+    dist.init_process_group(backend, rank=rank, world_size=size)
+    fn(rank, size)
+
+
+def butterfly_allreduce(send, rank, size):
+    """
+    Performs Butterfly All-Reduce over the process group. Modifies the input tensor in place.
+    Args:
+        send: torch.Tensor to be averaged with other processes.
+        rank: Current process rank (in a range from 0 to size)
+        size: Number of workers
+    """
+
+    buffer_for_chunk = torch.empty((size,), dtype=torch.float)
+
+    send_futures = []
+
+    for i, elem in enumerate(send):
+        if i != rank:
+            send_futures.append(dist.isend(elem, i))
+
+    recv_futures = []
+
+    for i, elem in enumerate(buffer_for_chunk):
+        if i != rank:
+            recv_futures.append(dist.irecv(elem, i))
+        else:
+            elem.copy_(send[i])
+
+    for future in recv_futures:
+        future.wait()
+
+    # compute the average
+    torch.mean(buffer_for_chunk, dim=0, out=send[rank])
+
+    for i in range(size):
+        if i != rank:
+            send_futures.append(dist.isend(send[rank], i))
+
+    recv_futures = []
+
+    for i, elem in enumerate(send):
+        if i != rank:
+            recv_futures.append(dist.irecv(elem, i))
+
+    for future in recv_futures:
+        future.wait()
+    for future in send_futures:
+        future.wait()
+
+
+def ring_allreduce(send, rank, size):
+    """
+    Performs Ring All-Reduce over the process group. Modifies the input tensor in place.
+    Args:
+        send: torch.Tensor to be averaged with other processes.
+        rank: Current process rank (in a range from 0 to size)
+        size: Number of workers
+    """
+    pass
+
+
+def run_butterfly_allreduce(rank, size):
+    """Simple point-to-point communication."""
+    torch.manual_seed(rank)
+    tensor = torch.randn((size,), dtype=torch.float)
+    print("Rank ", rank, " has data ", tensor)
+    butterfly_allreduce(tensor, rank, size)
+    print("Rank ", rank, " has data ", tensor)
+
+
+if __name__ == "__main__":
+    size = 5
+    processes = []
+    port = random.randint(25000, 30000)
+    for rank in range(size):
+        p = Process(target=init_process, args=(rank, size, run_butterfly_allreduce, port))
+        p.start()
+        processes.append(p)
+
+    for p in processes:
+        p.join()
@@ -0,0 +1,111 @@
+import os
+
+import torch
+import torch.distributed as dist
+import torch.nn as nn
+import torch.nn.functional as F
+import torchvision.transforms as transforms
+from torch.utils.data import DataLoader
+from torch.utils.data.distributed import DistributedSampler
+from torchvision.datasets import CIFAR100
+
+torch.set_num_threads(1)
+
+
+def init_process(local_rank, fn, backend="nccl"):
+    """Initialize the distributed environment."""
+    dist.init_process_group(backend, rank=local_rank)
+    size = dist.get_world_size()
+    fn(local_rank, size)
+
+
+class Net(nn.Module):
+    """
+    A very simple model with minimal changes from the tutorial, used for the sake of simplicity.
+    Feel free to replace it with EffNetV2-XL once you get comfortable injecting SyncBN into models programmatically.
+    """
+
+    def __init__(self):
+        super().__init__()
+        self.conv1 = nn.Conv2d(3, 32, 3, 1)
+        self.conv2 = nn.Conv2d(32, 32, 3, 1)
+        self.dropout1 = nn.Dropout(0.25)
+        self.dropout2 = nn.Dropout(0.5)
+        self.fc1 = nn.Linear(6272, 128)
+        self.fc2 = nn.Linear(128, 100)
+        self.bn1 = nn.BatchNorm1d(128, affine=False)  # to be replaced with SyncBatchNorm
+
+    def forward(self, x):
+        x = self.conv1(x)
+        x = F.relu(x)
+
+        x = self.conv2(x)
+        x = F.relu(x)
+
+        x = F.max_pool2d(x, 2)
+        x = self.dropout1(x)
+
+        x = torch.flatten(x, 1)
+        x = self.fc1(x)
+        x = self.bn1(x)
+        x = F.relu(x)
+        x = self.dropout2(x)
+        output = self.fc2(x)
+        return output
+
+
+def average_gradients(model):
+    size = float(dist.get_world_size())
+    for param in model.parameters():
+        dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
+        param.grad.data /= size
+
+
+def run_training(rank, size):
+    torch.manual_seed(0)
+
+    dataset = CIFAR100(
+        "./cifar",
+        transform=transforms.Compose(
+            [
+                transforms.ToTensor(),
+                transforms.Normalize((0.5071, 0.4867, 0.4408), (0.2675, 0.2565, 0.2761)),
+            ]
+        ),
+        download=True,
+    )
+    # where's the validation dataset?
+    loader = DataLoader(dataset, sampler=DistributedSampler(dataset, size, rank), batch_size=64)
+
+    model = Net()
+    device = torch.device("cpu")  # replace with "cuda" afterwards
+    model.to(device)
+    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
+
+    num_batches = len(loader)
+
+    for _ in range(10):
+        epoch_loss = torch.zeros((1,), device=device)
+
+        for data, target in loader:
+            data = data.to(device)
+            target = target.to(device)
+
+            optimizer.zero_grad()
+            output = model(data)
+            loss = torch.nn.functional.cross_entropy(output, target)
+            epoch_loss += loss.detach()
+            loss.backward()
+            average_gradients(model)
+            optimizer.step()
+
+            acc = (output.argmax(dim=1) == target).float().mean()
+
+            print(f"Rank {dist.get_rank()}, loss: {epoch_loss / num_batches}, acc: {acc}")
+            epoch_loss = 0
+        # where's the validation loop?
+
+
+if __name__ == "__main__":
+    local_rank = int(os.environ["LOCAL_RANK"])
+    init_process(local_rank, fn=run_training, backend="gloo")  # replace with "nccl" when testing on several GPUs
@@ -0,0 +1,3 @@
+pytest==8.3.4
+torch==2.4.0
+torchvision==0.19.0
@@ -0,0 +1,29 @@
+import os
+
+import torch.distributed as dist
+
+
+def run_sequential(rank, size, num_iter=10):
+    """
+    Prints the process rank sequentially according to its number over `num_iter` iterations,
+    separating the output for each iteration by `---`
+    Example (3 processes, num_iter=2):
+    ```
+    Process 0
+    Process 1
+    Process 2
+    ---
+    Process 0
+    Process 1
+    Process 2
+    ```
+    """
+
+    pass
+
+
+if __name__ == "__main__":
+    local_rank = int(os.environ["LOCAL_RANK"])
+    dist.init_process_group(rank=local_rank, backend="gloo")
+
+    run_sequential(local_rank, dist.get_world_size())
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+pytest==8.3.4`
	`2`	`+torch==2.4.0`
	`3`	`+torchvision==0.19.0`