Prefill-related logic in input preparation for generation #42088

zucchini-nlp · 2025-11-07T12:48:03Z

What does this PR do?

We always have had an imperfect way to infer if we're in prefill or decoding stage, which caused us many bugs in the past. The most reliable way is to check cache position values but it is not compile-compatible and also has an edge case

Recently Manuel merged a PR to split prefill into its own function so now we can benefit from it and know with 100% certainty which stage we're in. This PR adds is_first_iteration flag to generation input preparation and replaces existing logic with the flag

Note in some models, we have to keep checking if cache_position[0] == 0 because first iteration does not mean first in total. We might get cached system prompt and we don't want to call some methods a second time (e.g. Qwen mRoPE)

Also it adds a test case for the above linked issue

HuggingFaceDocBuilderDev · 2025-11-07T12:57:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-11-07T14:15:20Z

Another worm of cans, assisted decoding has no prefill separated out and is causing issues now 😢

manueldeprada · 2025-11-10T12:50:18Z

Another worm of cans, assisted decoding has no prefill separated out and is causing issues now 😢

worm of cans?? 🤣 haha love it

Sooo this already arose on my PR. The main gist is that assisted generate does not prefill with the prompt tokens, but waits for the first batch of candidates and then prefills. Thus, we could not apply the standard prefill. But surely assisted_gen can pass the prefill flag on the first call, or we can also maybe call _prefill with the first batch of candidates.

zucchini-nlp · 2025-11-10T13:04:01Z

assisted_gen can pass the prefill flag on the first call

yeah, this seemed to be the easiest option. The only issue with VLMs is that we should not be passing certain inputs (pixels/etc) after a prefill phase. But with assistant model calling generate() many times internally, we end up with several "prefill" phases

zucchini-nlp · 2025-11-14T09:06:32Z

Support for continue_generation_from_past_cache is making it hard 😭 The concept of is_prefill and is_first_iteration can be different in some models and require different input preparation

I don't want us to multiplicate number of input args for prepare_inputs_generation so I might leave the cache_position[0] == 0 as the ultimate source of truth for prefill, and pass only is_first_iteration as an argument.

zucchini-nlp · 2025-11-19T11:09:02Z

@bot /style

github-actions · 2025-11-19T11:09:49Z

Style fix runs successfully without any file modified.

zucchini-nlp · 2025-11-19T12:11:51Z

src/transformers/generation/candidate_generator.py

+        if is_first_iteration is None:
+            generation_args = self.assistant_model._get_initial_cache_position(
+                input_ids.shape[1], input_ids.device, self.assistant_kwargs
+            )
+            generation_args = self.assistant_model.prepare_inputs_for_generation(
+                input_ids, is_first_iteration=True, **generation_args
+            )
+            generation_args[self.input_ids_key] = input_ids
+            for model_input_name in ["position_ids", "token_type_ids", "decoder_position_ids"]:
+                generation_args.pop(model_input_name, None)
+        else:


this is needed for specific models which prepare inputs differently depending on first vs subsequent iterations. For ex in multimodal models, we pass over multimodal data only in first iteration and then rely on cached inputs

Assisted generation however calls internally generate() many times and technically will trigger many times first_iiteration. This way we can call prefill only once per assistant model

Can we also add this to the comments. This is a nice to know. Possibly into the docstring directly, I think the scope is worth enough to cover properly

Seeing this is explained in utils directly, maybe just my order of reviewing was just bad then... Can keep it this way

zucchini-nlp · 2025-11-19T12:12:22Z

src/transformers/generation/utils.py

+        # Assisted generation completes the prefill stage in candidate generator so that
+        # we don't have several `prefill` calls in one generation loop. Skip `_prefill` for assistants
+        if not generation_config.is_assistant:
+            model_outputs = self._prefill(input_ids, generation_config, model_kwargs)
+            prefill_consumed = False
+        else:
+            model_kwargs = self._get_initial_cache_position(input_ids.shape[1], input_ids.device, model_kwargs)
+            prefill_consumed = True


same as above - since we already called prefill on assistant, we should not call it a second time

vasqu

LGTM overall, the logic is good

I mainly left some smaller comments and things that might've been missed. We should be a bit careful here and run-slow on a few models, e.g. gemma3, mamba2, etc

src/transformers/generation/candidate_generator.py

vasqu · 2025-11-19T14:27:18Z

src/transformers/generation/candidate_generator.py

+        if is_first_iteration is None:
+            generation_args = self.assistant_model._get_initial_cache_position(
+                input_ids.shape[1], input_ids.device, self.assistant_kwargs
+            )
+            generation_args = self.assistant_model.prepare_inputs_for_generation(
+                input_ids, is_first_iteration=True, **generation_args
+            )
+            generation_args[self.input_ids_key] = input_ids
+            for model_input_name in ["position_ids", "token_type_ids", "decoder_position_ids"]:
+                generation_args.pop(model_input_name, None)
+        else:


Can we also add this to the comments. This is a nice to know. Possibly into the docstring directly, I think the scope is worth enough to cover properly

src/transformers/generation/candidate_generator.py

vasqu · 2025-11-19T14:31:48Z

src/transformers/generation/candidate_generator.py

+        if is_first_iteration is None:
+            generation_args = self.assistant_model._get_initial_cache_position(
+                input_ids.shape[1], input_ids.device, self.assistant_kwargs
+            )
+            generation_args = self.assistant_model.prepare_inputs_for_generation(
+                input_ids, is_first_iteration=True, **generation_args
+            )
+            generation_args[self.input_ids_key] = input_ids
+            for model_input_name in ["position_ids", "token_type_ids", "decoder_position_ids"]:
+                generation_args.pop(model_input_name, None)
+        else:


Seeing this is explained in utils directly, maybe just my order of reviewing was just bad then... Can keep it this way

vasqu · 2025-11-19T14:46:49Z

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py

            # It is safe to assume that `length!=1` means we're in pre-fill because compiled
            # models currently cannot do assisted decoding
-            if cache_position[0] == 0 or self.model.rope_deltas is None:
+            if (cache_position[0] == 0 or not use_cache) or self.model.rope_deltas is None:


Can we not simplify here? Same for the other related models

i wanted to use is_first_iteration at first for all models, and realized the concept of prefill and first iteration can be different
Specifically in mRoPE, the deltas are computed once with the first prompt and should not be computed again if user wants to re-use cache and continue generation from where it's left

This caused me so much headache tbh 🙃

Makes sense, mRoPE strikes again 😢 cant wait when we have a standardized way here outside of modeling.py

src/transformers/models/reformer/modeling_reformer.py

src/transformers/models/xlm/modeling_xlm.py

src/transformers/models/xlnet/modeling_xlnet.py

vasqu · 2025-11-19T14:54:55Z

tests/models/git/test_modeling_git.py

-        expected_slice = torch.tensor([-0.8805, -0.8803, -0.8799], device=torch_device)
+        expected_slice = torch.tensor([-0.8433, -0.8432, -0.8429], device=torch_device)


Was this failing before?

nope, my small refactor changed the logits slightly due to difference in how caching is done. I will trigger slow tests and check how big of a diff I caused

Oke, i see why there is a difference. Prev we would compute attention for the image embeddings (around 900 tokens) always, even though it is cached. After this PR, caching works as expected and we have one single token at each decoding step

IMO the current version is more correct and it's expected that caching results in tiny numerical differences that can add up

zucchini-nlp · 2025-11-24T13:35:10Z

run-slow: git

github-actions · 2025-11-24T13:36:33Z

This comment contains run-slow, running the specified jobs:

models: ["models/git"]
quantizations: []

github-actions · 2025-11-24T13:52:04Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

git:
tests/models/git/test_modeling_git.py::GitModelIntegrationTest::test_batched_generation
tests/models/git/test_modeling_git.py::GitModelIntegrationTest::test_inference_image_captioning

zucchini-nlp · 2025-12-05T11:39:41Z

@vasqu requesting one last review. One tiny thing left to do is to make sure slow GIT tests pass, some expected values are hardware-dependent

Otherwise should be ready, addressed a few comments and answered q above

zucchini-nlp · 2025-12-05T11:40:56Z

run-slow: git

github-actions · 2025-12-05T11:42:08Z

This comment contains run-slow, running the specified jobs:

models: ["models/git"]
quantizations: []

github-actions · 2025-12-05T11:53:22Z

CI Results

Workflow Run ⚙️

Model CI Report

❌ Failed tests

git:
tests/models/git/test_modeling_git.py::GitModelIntegrationTest::test_batched_generation

github-actions · 2025-12-05T12:19:06Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aria, aya_vision, bamba, bloom, chameleon, clvp, cohere2_vision, csm, ctrl, deepseek_vl, deepseek_vl_hybrid, emu3, falcon_h1, falcon_mamba, fast_vlm, florence2

vasqu

LGTM overall, left smaller comments/nits but I think this looks pretty ready

Trusting you on fixing up git and other CI stuff 👁️ lots of potential for followups PRs to clean more but this is already big enough as is and solves the biggest issue(s)

vasqu · 2025-12-05T14:02:05Z

src/transformers/generation/candidate_generator.py

+        # Generate candidates. Run prefill-specific logic in first generation and prepare model kwargs.
+        # Some models prepare inputs differently depending on first vs subsequent iterations.(e.g. VLMs)
+        # Assisted generation however calls internally `self.generate()` many times and technically will
+        # lead to many `ifirst_iteration's`. This way we can call prefill only once per assistant model


Suggested change

# lead to many `ifirst_iteration's`. This way we can call prefill only once per assistant model

# lead to many `first_iteration's`. This way we can call prefill only once per assistant model

typo

vasqu · 2025-12-05T14:03:43Z

src/transformers/generation/candidate_generator.py

+        # lead to many `ifirst_iteration's`. This way we can call prefill only once per assistant model
+        if is_first_iteration:
+            generation_args = self.assistant_model._get_initial_cache_position(
+                input_ids.shape[1], input_ids.device, self.assistant_kwargs.copy()


Is there a specific reason we copy the kwargs here? Any risk this could be None?

I suspect some inplace ops but just checking

vasqu · 2025-12-05T14:06:31Z

src/transformers/generation/candidate_generator.py

    def _generate_candidates(self, generation_args: dict) -> tuple[torch.LongTensor, torch.FloatTensor | None]:
        """Generate candidate sequences using the assistant model."""
-        assistant_output = self.assistant_model.generate(**generation_args, **self.assistant_kwargs)
+        assistant_output = self.assistant_model.generate(**generation_args)


So we now directly write into the generation args instead (from the prep)

vasqu · 2025-12-05T14:08:25Z

src/transformers/generation/utils.py

        attention_mask: torch.LongTensor | None = None,
        inputs_embeds: torch.FloatTensor | None = None,
        cache_position: torch.LongTensor | None = None,
+        is_first_iteration: Optional[bool] = False,


Suggested change

is_first_iteration: Optional[bool] = False,

is_first_iteration: bool | None = False,

also often forgetting this, but we really should move to the new typing

vasqu · 2025-12-05T14:11:46Z

src/transformers/models/csm/generation_csm.py

Important model, so just double check with run-slow or similar

vasqu · 2025-12-05T14:14:19Z

src/transformers/models/gemma3/modeling_gemma3.py

-            pixel_values=kwargs.get("pixel_values"),
+            is_first_iteration,


Should we not keep pixel values here? I just fear that the args order won't match anymore? So is_first_iteration would be used as pixel values

vasqu · 2025-12-05T14:16:57Z

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py

            # It is safe to assume that `length!=1` means we're in pre-fill because compiled
            # models currently cannot do assisted decoding
-            if cache_position[0] == 0 or self.model.rope_deltas is None:
+            if (cache_position[0] == 0 or not use_cache) or self.model.rope_deltas is None:


Makes sense, mRoPE strikes again 😢 cant wait when we have a standardized way here outside of modeling.py

vasqu · 2025-12-05T14:19:44Z

src/transformers/models/qwen2_audio/modeling_qwen2_audio.py

        model_inputs = super().prepare_inputs_for_generation(*args, **kwargs)

-        if cache_position is not None and cache_position[0] == 0:
+        if is_first_iteration or not kwargs.get("use_cache", True):


Is this or intentional? Not sure if this happened elsewhere

vasqu · 2025-12-05T14:21:55Z

tests/generation/test_utils.py

        torch.testing.assert_close(transition_scores_sum, outputs.sequences_scores, rtol=1e-3, atol=1e-3)

+    @slow
+    def test_generate_inputs_embeds_one_token(self):


add prefill arg in generation

77f4b60

zucchini-nlp added 2 commits November 7, 2025 14:04

add a slow test

423a9cb

fix copies

0f659bb

can be like this but checking special tokens isn't good

906f88c

zucchini-nlp mentioned this pull request Nov 12, 2025

[Ernie 4.5] Ernie VL models #39585

Open

7 tasks

zucchini-nlp added 3 commits November 12, 2025 15:29

ig this solves the issue with assisted_gen+prefill

5a918de

update overwritten prepare_inpits_for_generation

ef04c51

prefill is actually when we have no cache at all.. Try this for now

00e4814

zucchini-nlp added 9 commits November 14, 2025 14:47

first iteration is not always techincally same as prefill

6338d17

fix?

7291607

fix now?

32e5465

update bloom

375ad90

fix smth

a184c8b

make style

79abb96

fix copies and skip test

36c6052

Merge branch 'main' into prefill-logic

1ff4e23

Merge branch 'main' into prefill-logic

d4d99cb

zucchini-nlp changed the title ~~[WIP] Prefill-related logic in input preparation for generation~~ Prefill-related logic in input preparation for generation Nov 19, 2025

zucchini-nlp commented Nov 19, 2025

View reviewed changes

fix copies

939e58d

zucchini-nlp requested a review from vasqu November 19, 2025 12:16

vasqu reviewed Nov 19, 2025

View reviewed changes

tiny updates after a review

820cc92

zucchini-nlp added 7 commits November 24, 2025 15:59

fix other slow tests

597b187

merge main

406bdb3

fix copies

39fb07a

do not pass the same kwargs twice in prefill

cf1486c

oops

1dc9cc2

Merge branch 'main' into prefill-logic

2a44715

have to revert? prob fails only on dgx

c82f538

zucchini-nlp requested a review from vasqu December 5, 2025 11:38

adjust slow test again

d156e10

vasqu approved these changes Dec 5, 2025

View reviewed changes

		expected_slice = torch.tensor([-0.8805, -0.8803, -0.8799], device=torch_device)
		expected_slice = torch.tensor([-0.8433, -0.8432, -0.8429], device=torch_device)

	# lead to many `ifirst_iteration's`. This way we can call prefill only once per assistant model
	# lead to many `first_iteration's`. This way we can call prefill only once per assistant model

	is_first_iteration: Optional[bool] = False,
	is_first_iteration: bool \| None = False,

Prefill-related logic in input preparation for generation #42088

Are you sure you want to change the base?

Prefill-related logic in input preparation for generation #42088

Uh oh!

Conversation

zucchini-nlp commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 7, 2025

Uh oh!

zucchini-nlp commented Nov 7, 2025

Uh oh!

manueldeprada commented Nov 10, 2025

Uh oh!

zucchini-nlp commented Nov 10, 2025

Uh oh!

zucchini-nlp commented Nov 14, 2025

Uh oh!

zucchini-nlp commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

github-actions bot commented Nov 24, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

zucchini-nlp commented Dec 5, 2025

Uh oh!

zucchini-nlp commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

CI Results

Model CI Report

❌ Failed tests

Uh oh!

zucchini-nlp commented Nov 7, 2025 •

edited

Loading

github-actions bot commented Nov 19, 2025 •

edited

Loading

zucchini-nlp Nov 19, 2025 •

edited

Loading

vasqu left a comment •

edited

Loading

vasqu left a comment •

edited

Loading