You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@
6
6
7
7
### PyTorch-Native Training-to-Serving Model Optimization
8
8
- Pre-train Llama-3.1-70B **1.5x faster** with float8 training
9
-
- Recover **77% of quantized perplexity degradation** on Llama-3.2-3B with QAT
9
+
- Recover **67% of quantized accuracy degradation** on Gemma3-4B with QAT
10
10
- Quantize Llama-3-8B to int4 for **1.89x faster** inference with **58% less memory**
11
11
12
12
<divalign="center">
@@ -106,6 +106,7 @@ Please see the [torchao compability table](https://github.com/pytorch/ao/issues/
106
106
107
107
TorchAO is integrated into some of the leading open-source libraries including:
108
108
109
+
* Unsloth for QAT, blog post coming soon!
109
110
* HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865)
110
111
* HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md)
111
112
* vLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
<summary><h3>Quantizer API (legacy)</h3></summary>
146
147
147
148
Alternatively, torchao provides a few hardcoded quantization settings through
148
149
the following Quantizers, but these may be removed soon:
@@ -191,8 +192,51 @@ model = qat_quantizer.prepare(model)
191
192
train_loop(model)
192
193
model = qat_quantizer.convert(model)
193
194
```
195
+
</details>
194
196
195
-
## torchtune integration
197
+
## Axolotl integration
198
+
199
+
[Axolotl](https://github.com/axolotl-ai-cloud) uses TorchAO to support quantized-aware fine-tuning. You can use the following commands to fine-tune, and then quantize a Llama-3.2-3B model:
200
+
201
+
```bash
202
+
axolotl train examples/llama-3/3b-qat-fsdp2.yaml
203
+
# once training is complete, perform the quantization step
# you should now have a quantized model saved in ./outputs/qat_out/quatized
206
+
```
207
+
208
+
Please see the [QAT documentation](https://docs.axolotl.ai/docs/qat.html) in axolotl for more details.
209
+
210
+
211
+
## Unsloth integration
212
+
213
+
[Unsloth](https://github.com/unslothai/unsloth) also leverages TorchAO for quantized-aware fine-tuning. Unsloth's QAT support can be used with both full and LoRA fine-tuning. For example:
For a full notebook example, see: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(14B)-Reasoning-Conversational.ipynb. A QAT-specific notebook is coming soon.
For more detail, please refer to [this QAT tutorial](https://pytorch.org/torchtune/main/tutorials/qat_finetune.html).
257
+
</details>
213
258
214
-
## Axolotl integration
259
+
## Evaluation Results
215
260
216
-
[Axolotl](https://github.com/axolotl-ai-cloud) uses torchao to support quantized-aware fine-tuning. You can use the following commands to fine-tune, and then quantize a Llama-3.2-3B model:
261
+
Int4 weight-only QAT + LoRA using a group size of 128, fine-tuned using Unsloth.
262
+
Both fine-tuning and evaluation was done on a single H100 GPU using the
Results for int4 per group weights, using a learning rate of 2e-6. For this quantization scheme, the
246
-
quantized path uses the more efficient [int4 tinygemm kernel](https://github.com/pytorch/pytorch/blob/a672f6c84e318bbf455f13dfdd3fd7c68a388bf5/aten/src/ATen/native/cuda/int4mm.cu#L1097).
0 commit comments