Skip to content

Commit f3a8e78

Browse files
authored
[Kimi K2 Thinking] Add model_free_ptq example (#2021)
SUMMARY: - Add a short readme and a simple FP8 Block example with Kimi K2 Thinking
1 parent 66ba262 commit f3a8e78

File tree

2 files changed

+37
-0
lines changed

2 files changed

+37
-0
lines changed

examples/model_free_ptq/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Quantizing models without a model definition
2+
3+
`model_free_ptq` provides a PTQ pathway for data-free schemes (such for FP8 Dynamic Per Token or FP8 Block). Specifically, this pathway removes the requirement for a model definition or the need to load the model through transformers. If you are interested in applying a data-free scheme, there are two key scenarios in which applying this pathway may make sense for your model:
4+
5+
1. The model does not have a model definition available through transformers. This may be the case for a brand new model which has not landed in transformers.
6+
2. The model is very large (such as Kimi K2 Thinking) and is running into issues with `oneshot`
7+
8+
9+
`model_free_ptq` works directly with the safetensors in the checkpoint to which observers are applied, thereby removing the requirement for a model definition or transformers.
10+
11+
# Quantizing Kimi K2 Thinking to FP8 Block
12+
13+
In `kimi_k2_thinking_fp8_block.py`, we call `model_free_ptq` by providing a `scheme` and `ignore` list, similar to how we provide reicpes to `oneshot` calls. In the case of Kimi-K2 Thinking, we apply the `FP8_BLOCK` scheme and ignore layers that are incompatible with a block_size of 128 (specifically, `kv_a_proj_with_mqa` and `q_a_proj`).
14+
15+
In contrast to `oneshot`, we expect the model stub or pathway string to be directly passed in, as opposed to first being loaded through transformers. Once complete, the model is compressed using compressed-tensors and saved to `SAVE_DIR`.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
from llmcompressor import model_free_ptq
2+
3+
MODEL_ID = "unsloth/Kimi-K2-Thinking-BF16"
4+
SAVE_DIR = "Kimi-K2-Thinking-FP8-Block"
5+
6+
# Apply FP8-Block to the model
7+
# Once quantized, the model is saved
8+
# using compressed-tensors to the SAVE_DIR.
9+
model_free_ptq(
10+
model_stub=MODEL_ID,
11+
save_directory=SAVE_DIR,
12+
scheme="FP8_BLOCK",
13+
ignore=[
14+
"re:.*gate$",
15+
"lm_head",
16+
"re:.*kv_a_proj_with_mqa$",
17+
"re:.*q_a_proj$",
18+
"model.embed_tokens",
19+
],
20+
max_workers=15,
21+
device="cuda:0",
22+
)

0 commit comments

Comments
 (0)