You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,9 @@ __This branch corresponds to the ongoing 2024 course. If you want to see full ma
27
27
-[__Week 8:__](./week08_inference_software)__LLM inference optimizations and software__
28
28
- Lecture: Inference speed metrics. KV caching, batch inference, continuous batching. FlashAttention with its modifications and PagedAttention. Overview of popular LLM serving frameworks.
29
29
- Seminar: Basics of the Triton language. Layer fusion in PyTorch and Triton. Implementation of KV caching, FlashAttention in practice.
30
-
-__Week 9:____Efficient model inference__
30
+
-[__Week 9:__](./week09_compression)__Efficient model inference__
31
+
- Lecture: Hardware utilization metrics for deep learning. Knowledge distillation, quantization, LLM.int8(), SmoothQuant, GPTQ. Efficient model architectures. Speculative decoding.
32
+
- Seminar: Measuring Memory Bandwidth Utilization in practice. Data-free quantization, GPTq, and SmoothQuant in PyTorch.
* Homework: see [homework/README.md](homework/README.md)
6
+
7
+
### Setup for the seminar notebook
8
+
You can use [conda](https://docs.anaconda.com/free/miniconda/), [mamba](https://mamba.readthedocs.io/en/latest/user_guide/mamba.html) or [micromamba](https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html) to create the environment.
Implement models, training procedures and benchmarks in `.py` files, run all code in a Jupyter notebook and convert it to the PDF format.
5
+
Include your implementations and the report file into a `.zip` archive and submit it.
6
+
7
+
8
+
## Task 1: knowledge distillation for image classification (6 points)
9
+
10
+
0. Finetune ResNet101 on CIFAR10: change only the classification linear layer [*] and don't freeze other weights (**0 points**)
11
+
12
+
Then take untrained ResNet101 model, remove the `layer3` (except one conv block that creates correct number of channels for the 4-th layer) block out of it and implement 3 training setups:
13
+
1. Train the model on input data only (**1 point**)
14
+
2. Train the model on data and add soft cross-entropy between the student (truncated ResNet101) and the teacher (finetuned full ResNet101) (**2 points**)
15
+
3. Train the model as in the previous subtask, but also add the MSE loss between corresponding `layer1`, `layer2` and `layer4` features of the student and the teacher (**3 points**)
16
+
4. Report test accuracy for each of the models
17
+
18
+
[\*] Vanilla ResNet is not very well suited for CIFAR: it downsamples the image by x32, while images in CIFAR are 32x32 pixels. So you can:
19
+
- upsample images (easiest to implement, but you will perform more computations)
20
+
- slightly change the first layers (e.g. make `model.conv1` a 3x3 convolution with stride 1 and remove `model.maxpool`)
21
+
22
+
Feel free to use dataset and model implementation from PyTorch.
23
+
For losses in 2nd and 3rd subtasks use the simple average of all inputs.
24
+
For the 3rd subtask, you will need to return not only the model's outputs but also intermediate feature maps.
25
+
26
+
### Training setup
27
+
- Use the standard Adam optimizer without scheduler.
28
+
- Use any suitable batch size from 128 to 512.
29
+
- Training stopping criterion: accuracy (measured from 0 to 1) stabilizes in the second digit after decimal during at least 2 epochs on test set.
30
+
That means that you must satisfy condition `torch.abs(acc - acc_prev) < 0.01` for at least two epochs in a row.
31
+
32
+
## Task 2: use `deepsparse` to prune & quantize your model (4 points)
33
+
34
+
0. Please read the whole task description before starting it.
35
+
1. Install `deepsparse==1.7.0` and `sparseml==1.7.0`. Note: they might not work smoothly with last PyTorch versions. If so, you can downgrade to `torch==1.12.1`.
36
+
2. Take your best trained model from subtasks 1.1-1.3 and run pruning + quantization-aware-training, adapting the following [example](./example_train_sparse_and_quantize.py). You will need to change/implement what is marked by #TODO and report the test accuracy of both models. (**3 points**)
37
+
3. Take `onnx` baseline (best trained model from subtask 1.1 - 1.3) and pruned-quantized version and benchmark both models on the CPU using `deepsparse.benchmark` at batch sizes 1 and 32. (**1 point**)
38
+
39
+
For task 2.3, you may find [this page](https://web.archive.org/web/20240319095504/https://docs.neuralmagic.com/user-guides/deepsparse-engine/benchmarking/) helpful.
40
+
41
+
You should not use training stopping criterion in this part, since the sparsification recipe relies on having certain amount of epochs.
42
+
43
+
### Tips:
44
+
- Debug your code with resnet18 to iterate faster
45
+
- Don't forget `model.eval()` before onnx export
46
+
- Don't forget `convert_qat=True` in `sparseml.pytorch.utils.export_onnx` after you trained the model with quantization
47
+
- To visualize ONNX models, you can use [netron](https://netron.app/)
48
+
- Explicitly set the amount of cores in `deepsparse.benchmark`
49
+
- If you are desperate and don't have time to train bigger models, submit this part with resnet18
0 commit comments