Refactor weight loading #41580

ArthurZucker · 2025-10-14T13:38:31Z

CORE REFACTORING, loading, converting, logging

More helpful debugging report when loading weights

If you just want to fuse qkv:

It can. You just need to make sure you change the model code and pouf!

            WeightConverter(
                ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],
                "self_attn.qkv_proj",
                operations=[Concatenate(dim=0)],  # more like stack?
            ),

For deepseek we will embed the rope permute:

            WeightConverter(
                ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],
                operations=[RopePermute()],  # more like stack?
            ),

`WeightConverter` API:

The API allows you to define a mapping using WeightConverter. You can define many to one source/target keys, quantization opérations and distributed opérations along with normal opérations. For now MergeModuleLIst and Concatenate, will add the RopePermute one soon.

_checkpoint_conversion_mapping = {
    "mixtral": [
        WeightConverter(
            source_keys=[
                "mlp.experts.*.w1.weight",
                "mlp.experts.*.w3.weight",
            ],
            target_keys="mlp.experts.gate_up_proj",
            operations=[MergeModulelist(dim=0), Concatenate(dim=1)],
        ),
        WeightConverter(
            source_keys=["mlp.experts.*.w2.weight"],
            target_keys="mlp.experts.down_proj",
            operations=[MergeModulelist(dim=0)],
        ),
    ],
}

We use to have this:

https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L4545-L4568

But now its just explicit:

        "legacy": [
            WeightConverter(
                source_keys="LayerNorm.gamma",
                target_keys="LayerNorm.weight",
            ),
            WeightConverter(
                source_keys="LayerNorm.beta",
                target_keys="LayerNorm.bias",
            ),
        ],
    }
    if hasattr(torch.nn.utils.parametrizations, "weight_norm"):
        mapping["legacy"] += [
            WeightConverter(
                source_keys="weight_g",
                target_keys="parametrizations.weight.original0",
            ),
            WeightConverter(
                source_keys="weight_v",
                target_keys="parametrizations.weight.original1",
            ),
        ]
    else:
        mapping["legacy"] += [
            WeightConverter(
                source_keys="parametrizations.weight.original0",
                target_keys="weight_g",
            ),
            WeightConverter(
                source_keys="parametrizations.weight.original1",
                target_keys="weight_v",
            ),
        ]

and its faster cuz we don't iterate over the whole checkpoint

The core logic is:
Iterate over all of the dict keys:

collect the keys that match the glob patterns from all source keys (you pipe the ones that are from the same weight converter): (mlp.experts.*.gate_proj.weight|mlp.experts.*.up_proj.weight) into a dict with key target key

This produces:

{ 
"mlp.experts.gate_up_proj" : 
    {"mlp.experts.*.w1.weight":
        { "mlp.experts.0.w1.weight": [t0, t1, t2, etc], "mlp.experts.1.w1.weight": [t0, t1, t2, etc]},
     "mlp.experts.*.w3.weight":
        { "mlp.experts.0.w3.weight": [t0, t1, t2, etc], "mlp.experts.1.w3.weight": [t0, t1, t2, etc]},
    }
  ....
}

We need to keep track of which layers were collected, and from which source pattern.

1bis. Schedule tensor materialization, without blocking the GIL (as this takes the most amount of time). We distribute the tensor at this stage, before any operations. This IS the trickiest. We do this during collection to not waste time.

We collect the results of materialization, and we apply the operations on all the collected values (at this point { "mlp.experts.0.w1.weight": [t0, t1, t2, etc], "mlp.experts.1.w1.weight": [t0, t1, t2, etc]}.values() gives a list of lists.
We create a dict with the target_key and the output values. We pass this to the quantizer
We quantize the input tensors, outputting the final dict.
We set the param into the model.

Keys are handled a lot better!

Enable MoE quantization for FP8

This script does not work on main

import torch
from transformers import MixtralForCausalLM, AutoTokenizer, FineGrainedFP8Config
import time 
quantization_config = FineGrainedFP8Config(modules_to_not_convert=["model.layers.*.mlp.gate"])
model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", quantization_config=quantization_config, tp_plan="auto")

Enable TP + MoE without OOM

This script does not work on main

model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", tp_plan="auto")

Enable `device_map="auto"` + MoE + FP8

This script does not work on main

quantization_config = FineGrainedFP8Config(modules_to_not_convert=["model.layers.*.mlp.gate"])
model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", quantization_config=quantization_config, device_map="auto")

Refactor the way we load weights, faster, flexible and better overall

Uses staging buffers per conversion op

4x speedup with device_map="auto"
Full MoE quantization with FP8

TODOS:

Script:

import torch
from torch import nn
from transformers import MixtralForCausalLM, AutoTokenizer

import time 
start = time.time()
model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1", device_map="auto")
end = time.time() 
print("loading took ", end-start)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
inputs = tokenizer("hey how are you?", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16)
print(tokenizer.batch_decode(out))

loading took  14.271092891693115
['<s> hey how are you?\n\nI am a 20 year old male and I have been having']

⬆️ is with: merge modulelist, concat gate_up
⬇️ is naive loading.

loading took  54.271092891693115
['<s> hey how are you?\n\nI am a 20 year old male and I have been having']

src/transformers/core_model_loading.py

src/transformers/conversion_mapping.py

HuggingFaceDocBuilderDev · 2025-10-17T08:30:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/conversion_mapping.py

src/transformers/core_model_loading.py

src/transformers/modeling_utils.py

LysandreJik

Impressive effort

…to correct device and etc

github-actions · 2025-11-13T15:41:42Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, albert, align

xenova · 2025-11-14T05:37:17Z

Very cool! 🔥 After pulling latest changes from main and trying to load gpt-oss-20b, I get this error:

>>> from transformers import AutoModelForCausalLM
>>> model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b")
Unrecognized keys in `rope_parameters` for 'rope_type'='yarn': {'truncate'}
Unrecognized keys in `rope_parameters` for 'rope_type'='yarn': {'truncate'}
Loading weights: 100%|████████████████████████████████████████████████████████████████████████████████████████| 459/459 [00:00<00:00, 574.18it/s, Materializing param=lm_head.weight]
GptOssForCausalLM LOAD REPORT from: openai/gpt-oss-20b
Key                                                   | Status     | 
------------------------------------------------------+------------+-
model.layers.{0...23}.mlp.experts.down_proj_blocks    | UNEXPECTED | 
model.layers.{0...23}.mlp.experts.gate_up_proj_scales | UNEXPECTED | 
model.layers.{0...23}.mlp.experts.gate_up_proj_blocks | UNEXPECTED | 
model.layers.{0...23}.mlp.experts.down_proj_scales    | UNEXPECTED | 
model.layers.{0...23}.mlp.experts.gate_up_proj        | MISSING    | 
model.layers.{0...23}.mlp.experts.down_proj           | MISSING    | 

Notes:
- UNEXPECTED    :can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING       :those params were newly initialized because missing form the checkpoint. Consider training on your downstream task.

Just flagging as it seems to break backwards compatibility. I can also confirm that checking out the 2nd last commit (i.e., without this change) does not result in the error.

ArthurZucker · 2025-11-14T10:18:16Z

It won't break, @MekkCyber and @SunMarc are working on MXFp4 support!

MekkCyber · 2025-11-14T10:30:11Z

Yes @xenova we are taking care of that here : #42070, we just need to fix some issues and it will be good to go

fxmarty-amd · 2025-11-17T16:45:02Z

src/transformers/models/qwen2_moe/modeling_qwen2_moe.py

-        for _ in range(config.num_experts):
-            self.append(Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size))


This change is not straightforward and breaks downstream libraries expecting Qwen2MoeExperts experts to be nn.Linear. Is there an easy workaround?

fxmarty-amd · 2025-11-17T16:45:20Z

src/transformers/models/qwen3_moe/modeling_qwen3_moe.py

-        for _ in range(self.num_experts):
-            self.append(Qwen3MoeMLP(config, intermediate_size=config.moe_intermediate_size))


same comment

fxmarty-amd · 2025-11-17T16:45:37Z

🫠

fxmarty-amd · 2025-11-17T16:51:04Z

Just for my understanding - is this expected to land in 4.58?

ArthurZucker commented Oct 14, 2025

View reviewed changes

src/transformers/core_model_loading.py Show resolved Hide resolved

ArthurZucker commented Oct 14, 2025

View reviewed changes

src/transformers/conversion_mapping.py Outdated Show resolved Hide resolved

ArthurZucker commented Oct 14, 2025

View reviewed changes

src/transformers/conversion_mapping.py Outdated Show resolved Hide resolved

3outeille mentioned this pull request Oct 28, 2025

IndexError: tuple index out of range when using Tensor Parallelism with FSDP2 on GPT-OSS 20B (tensor_parallel.py, line 510) #41819

Open

4 tasks

molbap added the Core: Modeling Internals of the library; Models. label Oct 30, 2025

ArthurZucker added the for_v5? label Oct 30, 2025

ArthurZucker marked this pull request as ready for review October 30, 2025 16:22