Update TorchAO README inference section before PTC (#3206)

jerryzh168 · web-flow · commit 7e5d9078c07b · 2025-10-20T12:22:02.000-07:00
Summary:
att

Test Plan:
visual inspection

Reviewers:

Subscribers:

Tasks:

Tags:
diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@
 
 - [Oct 20] MXFP8 MoE training prototype achieved **~1.45x speedup** for MoE layer in Llama4 Scout, and **~1.25x** speedup for MoE layer in DeepSeekV3 671b - with comparable numerics to bfloat16! Check out the [docs](./torchao/prototype/moe_training/) to try it out.
 - [Sept 25] MXFP8 training achieved [1.28x speedup on Crusoe B200 cluster](https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/) with virtually identical loss curve to bfloat16!
+- [Sept 19] [TorchAO Quantized Model and Quantization Recipes Now Available on Huggingface Hub](https://pytorch.org/blog/torchao-quantized-models-and-quantization-recipes-now-available-on-huggingface-hub/)!
 - [Jun 25] Our [TorchAO paper](https://openreview.net/attachment?id=HpqH0JakHf&name=pdf) was accepted to CodeML @ ICML 2025!
 - [May 25] QAT is now integrated into [Axolotl](https://github.com/axolotl-ai-cloud/axolotl) for fine-tuning ([docs](https://docs.axolotl.ai/docs/qat.html))!
 - [Apr 25] Float8 rowwise training yielded [1.34-1.43x training speedup](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/) at 2k H100 GPU scale
@@ -59,13 +60,6 @@ TorchAO is an easy to use quantization library for native PyTorch. TorchAO works
 
 Check out our [docs](https://docs.pytorch.org/ao/main/) for more details!
 
-From the team that brought you the fast series:
-* 9.5x inference speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
-* 10x inference speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
-* 3x inference speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3) (new: [flux-fast](https://pytorch.org/blog/presenting-flux-fast-making-flux-go-brrr-on-h100s/))
-* 2.7x inference speedup for FAIR’s Seamless M4T-v2 model with [seamlessv2-fast](https://pytorch.org/blog/accelerating-generative-ai-4/)
-
-
 ## 🚀 Quick Start
 
 First, install TorchAO. We recommend installing the latest stable version:
@@ -76,20 +70,9 @@ pip install torchao
 Quantize your model weights to int4!
 ```python
 from torchao.quantization import Int4WeightOnlyConfig, quantize_
-quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1))
-```
-Compared to a `torch.compiled` bf16 baseline, your quantized model should be significantly smaller and faster on a single A100 GPU:
-```bash
-int4 model size: 1.25 MB
-bfloat16 model size: 4.00 MB
-compression ratio: 3.2
-
-bf16 mean time: 30.393 ms
-int4 mean time: 4.410 ms
-speedup: 6.9x
+quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
 ```
-See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details. Alternatively, try quantizing your favorite model using our [HuggingFace space](https://huggingface.co/spaces/pytorch/torchao-my-repo)!
-
+See our [quick start guide](https://docs.pytorch.org/ao/stable/quick_start.html) for more details.
 
 ## 🛠 Installation
 
@@ -103,16 +86,18 @@ pip install torchao
 
   ```
   # Nightly
-  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
+  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128
 
   # Different CUDA versions
   pip install torchao --index-url https://download.pytorch.org/whl/cu126  # CUDA 12.6
+  pip install torchao --index-url https://download.pytorch.org/whl/cu129  # CUDA 12.9
   pip install torchao --index-url https://download.pytorch.org/whl/cpu    # CPU only
 
   # For developers
   USE_CUDA=1 python setup.py develop
   USE_CPP=0 python setup.py develop
   ```
+
 </details>
 
 Please see the [torchao compability table](https://github.com/pytorch/ao/issues/2919) for version requirements for dependencies.
@@ -123,57 +108,64 @@ TorchAO is integrated into some of the leading open-source libraries including:
 
 * HuggingFace transformers with a [builtin inference backend](https://huggingface.co/docs/transformers/main/quantization/torchao) and [low bit optimizers](https://github.com/huggingface/transformers/pull/31865)
 * HuggingFace diffusers best practices with `torch.compile` and TorchAO in a standalone repo [diffusers-torchao](https://github.com/huggingface/diffusers/blob/main/docs/source/en/quantization/torchao.md)
+* vLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
+* Integration with [FBGEMM](https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/gen_ai) for SOTA kernels on server GPUs
+* Integration with [ExecuTorch](https://github.com/pytorch/executorch/) for edge device deployment
+* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
+* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
 * HuggingFace PEFT for LoRA using TorchAO as their [quantization backend](https://huggingface.co/docs/peft/en/developer_guides/quantization#torchao-pytorch-architecture-optimization)
-* Mobius HQQ backend leveraged our int4 kernels to get [195 tok/s on a 4090](https://github.com/mobiusml/hqq#faster-inference)
 * TorchTune for our NF4 [QLoRA](https://docs.pytorch.org/torchtune/main/tutorials/qlora_finetune.html), [QAT](https://docs.pytorch.org/torchtune/main/recipes/qat_distributed.html), and [float8 quantized fine-tuning](https://github.com/pytorch/torchtune/pull/2546) recipes
-* TorchTitan for [float8 pre-training](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
-* VLLM for LLM serving: [usage](https://docs.vllm.ai/en/latest/features/quantization/torchao.html), [detailed docs](https://docs.pytorch.org/ao/main/torchao_vllm_integration.html)
-* SGLang for LLM serving: [usage](https://docs.sglang.ai/backend/server_arguments.html#server-arguments) and the major [PR](https://github.com/sgl-project/sglang/pull/1341).
-* Axolotl for [QAT](https://docs.axolotl.ai/docs/qat.html) and [PTQ](https://docs.axolotl.ai/docs/quantize.html)
-
+* SGLang for LLM serving: [usage](https://docs.sglang.ai/advanced_features/quantization.html#online-quantization)
 
 ## 🔎 Inference
 
 TorchAO delivers substantial performance gains with minimal code changes:
 
-- **Int4 weight-only**: [1.89x throughput with 58.1% less memory](torchao/quantization/README.md) on Llama-3-8B
-- **Float8 dynamic quantization**: [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality
+- **Int4 weight-only**: [1.73x speedup with 65% less memory](https://huggingface.co/pytorch/gemma-3-12b-it-INT4) for Gemma3-12b-it on H100 with slight impact on accuracy
+- **Float8 dynamic quantization**: [1.5-1.6x speedup on gemma-3-27b-it](https://huggingface.co/pytorch/gemma-3-27b-it-FP8/blob/main/README.md#results-h100-machine) and [1.54x and 1.27x speedup on Flux.1-Dev* and CogVideoX-5b respectively](https://github.com/sayakpaul/diffusers-torchao) on H100 with preserved quality
+- **Int8 activation quantization and int4 weight quantization**: Quantized Qwen3-4B running with 14.8 tokens/s with 3379 MB memory usage on iPhone 15 Pro through [ExecuTorch](https://huggingface.co/pytorch/Qwen3-4B-INT8-INT4#running-in-a-mobile-app)
 - **Int4 + 2:4 Sparsity**: [2.37x throughput with 67.7% memory reduction](torchao/sparsity/README.md) on Llama-3-8B
 
-Quantize any model with `nn.Linear` layers in just one line (Option 1), or load the quantized model directly from HuggingFace using our integration with HuggingFace transformers (Option 2):
-
-#### Option 1: Direct TorchAO API
-
-```python
-from torchao.quantization.quant_api import quantize_, Int4WeightOnlyConfig
-quantize_(model, Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1))
-```
-
-#### Option 2: HuggingFace Integration
-
+Following is our recommended flow for quantization and deployment:
 ```python
 from transformers import TorchAoConfig, AutoModelForCausalLM
-from torchao.quantization.quant_api import Int4WeightOnlyConfig
+from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
 
 # Create quantization configuration
-quantization_config = TorchAoConfig(quant_type=Int4WeightOnlyConfig(group_size=128, use_hqq=True, version=1))
+quantization_config = TorchAoConfig(quant_type=Float8DynamicActivationFloat8WeightConfig(granularity=PerRow()))
 
 # Load and automatically quantize
 quantized_model = AutoModelForCausalLM.from_pretrained(
-    "microsoft/Phi-4-mini-instruct",
+    "Qwen/Qwen3-32B",
     dtype="auto",
     device_map="auto",
     quantization_config=quantization_config
 )
 ```
 
-#### Deploy quantized models in vLLM with one command:
+Alternative quantization API to use when the above doesn't work is `quantize_` API in [quick start guide](https://docs.pytorch.org/ao/main/quick_start.html).
+
+Serving with vllm on 1xH100 machine:
+```shell
+# Server
+VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Qwen3-32B-FP8 --tokenizer Qwen/Qwen3-32B -O3
+```
 
 ```shell
-vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
+# Client
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "pytorch/Qwen3-32B-FP8",
+  "messages": [
+    {"role": "user", "content": "Give me a short introduction to large language models."}
+  ],
+  "temperature": 0.6,
+  "top_p": 0.95,
+  "top_k": 20,
+  "max_tokens": 32768
+}'
 ```
 
-With this quantization flow, we achieve **67% VRAM reduction and 12-20% speedup** on A100 GPUs while maintaining model quality. For more detail, see this [step-by-step quantization guide](https://huggingface.co/pytorch/Phi-4-mini-instruct-int4wo-hqq#quantization-recipe). We also release some pre-quantized models [here](https://huggingface.co/pytorch).
+We also support deployment to edge devices through ExecuTorch, for more detail, see [quantization and serving guide](https://docs.pytorch.org/ao/main/serving.html). We also release pre-quantized models [here](https://huggingface.co/pytorch).
 
 ## 🚅 Training
 
diff --git a/docs/source/api_ref_quantization.rst b/docs/source/api_ref_quantization.rst
@@ -14,7 +14,6 @@ Main Quantization APIs
     :nosignatures:
 
     quantize_
-    autoquant
 
 Inference APIs for quantize\_
 -------------------------------
@@ -27,13 +26,9 @@ Inference APIs for quantize\_
     Float8DynamicActivationInt4WeightConfig
     Float8DynamicActivationFloat8WeightConfig
     Float8WeightOnlyConfig
-    Float8StaticActivationFloat8WeightConfig
     Int8DynamicActivationInt4WeightConfig
-    GemliteUIntXWeightOnlyConfig
     Int8WeightOnlyConfig
     Int8DynamicActivationInt8WeightConfig
-    UIntXWeightOnlyConfig
-    FPXWeightOnlyConfig
 
 .. currentmodule:: torchao.quantization
 
@@ -51,19 +46,4 @@ Quantization Primitives
     safe_int_mm
     int_scaled_matmul
     MappingType
-    ZeroPointDomain
     TorchAODType
-
-..
-  TODO: delete these?
-
-Other
------
-
-.. autosummary::
-    :toctree: generated/
-    :nosignatures:
-
-    to_linear_activation_quantized
-    swap_linear_with_smooth_fq_linear
-    smooth_fq_linear_to_inference
diff --git a/docs/source/quick_start.rst b/docs/source/quick_start.rst
@@ -2,20 +2,8 @@ Quick Start Guide
 -----------------
 
 In this quick start guide, we will explore how to perform basic quantization using torchao.
-First, install the latest stable torchao release::
-
-  pip install torchao
-
-If you prefer to use the nightly release, you can install torchao using the following
-command instead::
-
-  pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121
-
-torchao is compatible with the latest 3 major versions of PyTorch, which you will also
-need to install (`detailed instructions <https://pytorch.org/get-started/locally/>`__)::
-
-  pip install torch
 
+Follow `torchao installation and compatibility guide <https://github.com/pytorch/ao#-installation>`__ to install torchao and compatible pytorch.
 
 First Quantization Example
 ==========================
@@ -55,9 +43,8 @@ for efficient mixed dtype matrix multiplication:
 
 .. code:: py
 
-  # torch 2.4+ only
   from torchao.quantization import Int4WeightOnlyConfig, quantize_
-  quantize_(model, Int4WeightOnlyConfig(group_size=32, version=1))
+  quantize_(model, Int4WeightOnlyConfig(group_size=32, int4_packing_format="tile_packed_to_4d", int4_choose_qparams_algorithm="hqq"))
 
 The quantized model is now ready to use! Note that the quantization
 logic is inserted through tensor subclasses, so there is no change
diff --git a/docs/source/serving.rst b/docs/source/serving.rst
@@ -15,7 +15,7 @@ Post-training Quantization with HuggingFace
 -------------------------------------------
 
 HuggingFace Transformers provides seamless integration with torchao quantization. The ``TorchAoConfig`` automatically applies torchao's optimized quantization algorithms during model loading.
-Please check out our `HF Integration Docs <torchao_hf_integration.html>`_ for examples on how to use quantization and sparsity in Transformers and Diffusers.
+Please check out our `HF Integration Docs <torchao_hf_integration.html>`_ for examples on how to use quantization and sparsity in Transformers and Diffusers and `TorchAOConfig Reference <api_ref_quantization.html#inference-apis-for-quantize>`_ for all available torchao configs to use.
 
 Serving and Inference
 --------------------
@@ -29,19 +29,19 @@ First, install vLLM with torchao support:
 
 .. code-block:: bash
 
-    pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
-    pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
+    pip install vllm --pre --extra-index-url https://download.pytorch.org/whl/nightly/vllm/
+    pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu128
 
 To serve in vLLM, we're using the model we quantized and pushed to Hugging Face hub in the previous step :ref:`Post-training Quantization with HuggingFace`.
 
 .. code-block:: bash
 
     # Server
-    vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+    vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3
 
     # Client
     curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
-        "model": "pytorch/Phi-4-mini-instruct-float8dq",
+        "model": "pytorch/Phi-4-mini-instruct-FP8",
         "messages": [
             {"role": "user", "content": "Give me a short introduction to large language models."}
         ],
@@ -271,8 +271,8 @@ Evaluate quantized models using lm-evaluation-harness:
     # Evaluate baseline model
     lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 
-    # Evaluate torchao-quantized model (float8dq)
-    lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
+    # Evaluate torchao-quantized model (FP8)
+    lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-FP8 --tasks hellaswag --device cuda:0 --batch_size 8
 
 Memory Benchmarking
 ^^^^^^^^^^^^^^^^^
@@ -283,8 +283,8 @@ For Phi-4-mini-instruct, when quantized with float8 dynamic quant, we can reduce
     import torch
     from transformers import AutoModelForCausalLM, AutoTokenizer
 
-    # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-float8dq"
-    model_id = "pytorch/Phi-4-mini-instruct-float8dq"
+    # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-FP8"
+    model_id = "pytorch/Phi-4-mini-instruct-FP8"
     quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16)
     tokenizer = AutoTokenizer.from_pretrained(model_id)
 
@@ -328,7 +328,7 @@ Output:
     Peak Memory Usage: 5.70 GB
 
 +-------------------+---------------------+------------------------------+
-| Benchmark         | Phi-4 mini-instruct | Phi-4-mini-instruct-float8dq |
+| Benchmark         | Phi-4 mini-instruct | Phi-4-mini-instruct-FP8 |
 +===================+=====================+==============================+
 | Peak Memory (GB)  | 8.91                | 5.70 (36% reduction)         |
 +-------------------+---------------------+------------------------------+
@@ -342,10 +342,10 @@ Latency Benchmarking
 .. code-block:: bash
 
     # baseline
-    python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
+    vllm bench latency --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
 
-    # float8dq
-    VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-float8dq --batch-size 1
+    # FP8
+    VLLM_DISABLE_COMPILE_CACHE=1 vllm bench latency --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-FP8 --batch-size 1
 
 Serving Benchmarking
 """""""""""""""""""""
@@ -372,13 +372,13 @@ We benchmarked the throughput in a serving environment.
     # Server:
     vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
     # Client:
-    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
+    vllm bench serve --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
 
-    # For float8dq
+    # For FP8
     # Server:
-    VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-float8dq --tokenizer microsoft/Phi-4-mini-instruct -O3
+    VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-FP8 --tokenizer microsoft/Phi-4-mini-instruct -O3
     # Client:
-    python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
+    vllm bench serve --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-FP8 --num-prompts 1
 
 Results (H100 machine)
 """""""""""""""""""""