|
1 | 1 | # HF -> torchao -> vLLM convenience scripts |
2 | 2 |
|
3 | | -## Usage Example |
| 3 | +## hf -> torchao -> vLLM |
4 | 4 |
|
5 | 5 | ```bash |
6 | | -# save a quantized model ot data/nvfp4-Qwen1.5-MoE-A2.7B |
| 6 | +# save a quantized model to data/torchao/nvfp4-Qwen1.5-MoE-A2.7B |
| 7 | +# models tested: Qwen1.5-MoE-A2.7B (small MoE), facebook/opt-125m (tiny dense) |
| 8 | +# quant_type tested: fp8 (which means fp8 rowwise here), nvfp4 (without any calibration) |
7 | 9 | python quantize_hf_model_with_torchao.py --model_name "Qwen/Qwen1.5-MoE-A2.7B" --experts_only_qwen_1_5_moe_a_2_7b True --save_model_to_disk True --quant_type nvfp4 |
8 | 10 |
|
9 | 11 | # run the model from above in vLLM |
| 12 | +# requires https://github.com/vllm-project/vllm/pull/26095 (currently not landed) |
10 | 13 | python run_quantized_model_in_vllm.py --model_name "data/torchao/nvfp4-Qwen1.5-MoE-A2.7B" --compile False |
11 | 14 | ``` |
12 | 15 |
|
| 16 | +## hf -> torchao -> compressed_tensors checkpoint -> vLLM |
| 17 | + |
| 18 | +```bash |
| 19 | +# save a quantized model to data/torchao/fp8-opt-125m |
| 20 | +# models tested: Qwen1.5-MoE-A2.7B (small MoE), facebook/opt-125m (tiny dense) |
| 21 | +# quant_type tested: fp8 (which means fp8 rowwise here), nvfp4 (without any calibration). Note that nvfp4 on the MoE model leads to an error in vLLM. |
| 22 | +python quantize_hf_model_with_torchao.py --model_name "facebook/opt-125m" --experts_only_qwen_1_5_moe_a_2_7b False --save_model_to_disk True --quant_type fp8 |
| 23 | + |
| 24 | +# (optional) save a quantized model with llm-compressor to data/llmcompressor/fp8-opt-125m |
| 25 | +python quantize_hf_model_with_llm_compressor.py --model_name facebook/opt-125m --quant_type fp8 |
| 26 | + |
| 27 | +# (optional) inspect the torchao and compressed-tensors checkpoints |
| 28 | +python inspect_torchao_output.py --dir_name data/torchao/fp8-opt-125m |
| 29 | +python inspect_llm_compressor_output.py --dir_name data/llmcompressor/fp8-opt-125m |
| 30 | + |
| 31 | +# convert the torchao checkpoint to compressed-tensors format |
| 32 | +python convert_torchao_checkpoint_to_compressed_tensors.py --dir_source data/torchao/fp8-opt-125m --dir_target data/torchao_compressed_tensors/fp8-opt-125m --dir_validation data/llmcompressor/fp8-opt-125m |
| 33 | + |
| 34 | +# run the converted model vLLM |
| 35 | +python run_quantized_model_in_vllm.py --model_name "data/torchao_compressed_tensors/fp8-opt-125m" --compile False |
| 36 | +``` |
| 37 | + |
13 | 38 | ## Code Quality & Linting |
14 | 39 |
|
15 | 40 | ```bash |
|
0 commit comments