Skip to content

Commit e3bdc11

Browse files
authored
Update README.md
1 parent e145bda commit e3bdc11

File tree

1 file changed

+27
-2
lines changed

1 file changed

+27
-2
lines changed

hf_torchao_vllm/README.md

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,40 @@
11
# HF -> torchao -> vLLM convenience scripts
22

3-
## Usage Example
3+
## hf -> torchao -> vLLM
44

55
```bash
6-
# save a quantized model ot data/nvfp4-Qwen1.5-MoE-A2.7B
6+
# save a quantized model to data/torchao/nvfp4-Qwen1.5-MoE-A2.7B
7+
# models tested: Qwen1.5-MoE-A2.7B (small MoE), facebook/opt-125m (tiny dense)
8+
# quant_type tested: fp8 (which means fp8 rowwise here), nvfp4 (without any calibration)
79
python quantize_hf_model_with_torchao.py --model_name "Qwen/Qwen1.5-MoE-A2.7B" --experts_only_qwen_1_5_moe_a_2_7b True --save_model_to_disk True --quant_type nvfp4
810

911
# run the model from above in vLLM
12+
# requires https://github.com/vllm-project/vllm/pull/26095 (currently not landed)
1013
python run_quantized_model_in_vllm.py --model_name "data/torchao/nvfp4-Qwen1.5-MoE-A2.7B" --compile False
1114
```
1215

16+
## hf -> torchao -> compressed_tensors checkpoint -> vLLM
17+
18+
```bash
19+
# save a quantized model to data/torchao/fp8-opt-125m
20+
# models tested: Qwen1.5-MoE-A2.7B (small MoE), facebook/opt-125m (tiny dense)
21+
# quant_type tested: fp8 (which means fp8 rowwise here), nvfp4 (without any calibration). Note that nvfp4 on the MoE model leads to an error in vLLM.
22+
python quantize_hf_model_with_torchao.py --model_name "facebook/opt-125m" --experts_only_qwen_1_5_moe_a_2_7b False --save_model_to_disk True --quant_type fp8
23+
24+
# (optional) save a quantized model with llm-compressor to data/llmcompressor/fp8-opt-125m
25+
python quantize_hf_model_with_llm_compressor.py --model_name facebook/opt-125m --quant_type fp8
26+
27+
# (optional) inspect the torchao and compressed-tensors checkpoints
28+
python inspect_torchao_output.py --dir_name data/torchao/fp8-opt-125m
29+
python inspect_llm_compressor_output.py --dir_name data/llmcompressor/fp8-opt-125m
30+
31+
# convert the torchao checkpoint to compressed-tensors format
32+
python convert_torchao_checkpoint_to_compressed_tensors.py --dir_source data/torchao/fp8-opt-125m --dir_target data/torchao_compressed_tensors/fp8-opt-125m --dir_validation data/llmcompressor/fp8-opt-125m
33+
34+
# run the converted model vLLM
35+
python run_quantized_model_in_vllm.py --model_name "data/torchao_compressed_tensors/fp8-opt-125m" --compile False
36+
```
37+
1338
## Code Quality & Linting
1439

1540
```bash

0 commit comments

Comments
 (0)