[Kimi K2 Thinking] Add model_free_ptq example (#2021)

dsikka · web-flow · commit f3a8e7815d12 · 2025-11-11T15:05:07.000-05:00
SUMMARY:
- Add a short readme and a simple FP8 Block example with Kimi K2
Thinking
diff --git a/examples/model_free_ptq/README.md b/examples/model_free_ptq/README.md
@@ -0,0 +1,15 @@
+# Quantizing models without a model definition 
+
+`model_free_ptq` provides a PTQ pathway for data-free schemes (such for FP8 Dynamic Per Token or FP8 Block). Specifically, this pathway removes the requirement for a model definition or the need to load the model through transformers. If you are interested in applying a data-free scheme, there are two key scenarios in which applying this pathway may make sense for your model:
+
+1. The model does not have a model definition available through transformers. This may be the case for a brand new model which has not landed in transformers.
+2. The model is very large (such as Kimi K2 Thinking) and is running into issues with `oneshot`
+
+
+`model_free_ptq` works directly with the safetensors in the checkpoint to which observers are applied, thereby removing the requirement for a model definition or transformers.
+
+# Quantizing Kimi K2 Thinking to FP8 Block 
+
+In `kimi_k2_thinking_fp8_block.py`, we call `model_free_ptq` by providing a `scheme` and `ignore` list, similar to how we provide reicpes to `oneshot` calls. In the case of Kimi-K2 Thinking, we apply the `FP8_BLOCK` scheme and ignore layers that are incompatible with a block_size of 128 (specifically, `kv_a_proj_with_mqa` and `q_a_proj`).
+
+In contrast to `oneshot`, we expect the model stub or pathway string to be directly passed in, as opposed to first being loaded through transformers. Once complete, the model is compressed using compressed-tensors and saved to `SAVE_DIR`.
diff --git a/examples/model_free_ptq/kimi_k2_thinking_fp8_block.py b/examples/model_free_ptq/kimi_k2_thinking_fp8_block.py
@@ -0,0 +1,22 @@
+from llmcompressor import model_free_ptq
+
+MODEL_ID = "unsloth/Kimi-K2-Thinking-BF16"
+SAVE_DIR = "Kimi-K2-Thinking-FP8-Block"
+
+# Apply FP8-Block to the model
+# Once quantized, the model is saved
+# using compressed-tensors to the SAVE_DIR.
+model_free_ptq(
+    model_stub=MODEL_ID,
+    save_directory=SAVE_DIR,
+    scheme="FP8_BLOCK",
+    ignore=[
+        "re:.*gate$",
+        "lm_head",
+        "re:.*kv_a_proj_with_mqa$",
+        "re:.*q_a_proj$",
+        "model.embed_tokens",
+    ],
+    max_workers=15,
+    device="cuda:0",
+)