[Bugfix] Fix mamba2 prefill chunking #23279

tomeras91 · 2025-08-20T19:01:59Z

Purpose

Fix a few bugs with prefill chunking for the mamba2 kernels.

Prefill chunking with valren batching support for the Mamba2 block was added in #10909, but it contained a few bugs relating to handling of initial states across mamba chunk boundaries. Some of these bugs were fixed recently in #21783 - specifically regarding the decay factor of the initial states. Yet, in cases where chunk boundaries and sequence boundaries don't align (a sequence changes in the middle of a chunk), the state passing kernel with initial states was still buggy. Namely, the computation of dA_cumsum was computed from the start of the mamba chunk instead of from the start of the current sequence. This PR fixes this.

Other changes:

Added unittest for mamba_chunk_scan_combined with prefill chunking and varlen batching, comparing chunked results to those of the full sequence.
Added documentation to _query_start_loc_to_chunk_indices_offsets, which is a somewhat cryptic function
Added documentation to the bugfix in [Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 #21783

Test Plan

Make sure all cases in tests/kernels/mamba/test_mamba_ssm_ssd.py::test_mamba_chunk_scan_cont_batch_prefill_chunking pass. These cases fail on main.

Test Result

Tests pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

…actor to change names for better readability) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

github-actions · 2025-08-20T19:02:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request addresses a bug in the prefill chunking mechanism for Mamba2 kernels, specifically correcting the calculation of dA_cumsum across chunk boundaries. The solution involves passing the complete dA_cumsum tensor and chunk offsets to the state passing kernel, allowing for accurate adjustments based on sequence boundaries. Additionally, the PR introduces a new unit test to validate this fix and enhances the documentation for related functions.

My review identified a critical issue within the bug fix implementation in _state_passing_fwd_kernel. The mask for loading chunk offsets incorrectly excludes the last logical chunk, potentially causing incorrect calculations. I have provided a code suggestion to rectify this.

vllm/model_executor/layers/mamba/ops/ssd_state_passing.py

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

tdoublep

Thanks for finding this and fixing. Especially appreciate the effort to write better docstrings.

@fabianlim Could you also PTAL at this PR?

tdoublep · 2025-08-29T08:43:57Z

vllm/v1/attention/backends/mamba2_attn.py

+    query_start_loc = [0, 5, 10]
+    chunk_size = 8
+    total_seqlens = 10
+    -> chunk_indices = [0, 1, 0]


I'm a bit confused by this docstring. Shouldn't the second logical chunk in this example belong in the first physical chunk? E.g., shouldn't chunk_indices=[0,0,1] ?

Good catch thanks! Fixed

tdoublep · 2025-08-29T09:17:46Z

tests/kernels/mamba/test_mamba_ssm_ssd.py

                                         itype,
-                                         device='cuda'):
+                                         device='cuda',
+                                         return_ref=True):


Is there a specific reason why we need to add return_ref here? Couldn't we just ignore the output when we don't need it? Or does ssd_minimal_discrete update some of its inputs in-place?

That was my initial approach, but the problem is that ssd_minimal_discrete asserts the max sequence length is a multiple of the mamba chunk (aka block) size:

vllm/tests/kernels/mamba/test_mamba_ssm_ssd.py

Line 46 in 81eea3d

assert X.shape[1] % block_len == 0

This is an assumption I wanted to break in the new unittest, but I wanted to reuse the code to generate random inputs given a tuple of sequence lengths. Since I didn't really need the reference (pytorch) outputs, I decided to add this return_ref flag.

ic.. how about changing return_ref to return_naive_ref and then document the behavior of that flag

if return_naive_ref=True, we will use the navie implemenation ssd_minimal_discreteto compute and return the reference

Sounds good. Done

fabianlim · 2025-08-29T12:12:12Z

tests/kernels/mamba/test_mamba_ssm_ssd.py

                exhausted[i] = False
+
+
+@pytest.mark.parametrize("chunk_size", [8, 256])


this test looks like its replicating test_mamba_chunk_scan_cont_batch, what is the key difference?

The key difference is that this test makes sure prefill chunking is working as expected, without using the pytorch reference implementation. Instead, it compares the kernel output without prefill chunking to concatenated outputs with prefill chunking. This is the most straight-forward way to verify that prefill chunking is working as expected.

Another crucial difference from test_mamba_chunk_scan_cont_batch is that this test tests cases where the sequence length is not a multiple of the mamba chunk size. In other words - cases where a sequence changes in the middle of a mamba chunk. These are the cases which currently fail on main, and require the fixes in this PR. These cases are also no supported in the pytorch implementation (see other discussion), so they can't be easily added to test_mamba_chunk_scan_cont_batch which compares kernel results with the reference pytorch implementation.

I see thats for the explaination, in this case I suggest to put some documentation to explain how test_mamba_chunk_scan_cont_batch_prefill_chunking differs from the previous test, since this test if a little long and its hard to understanding by quick glance.

makes sense. Done

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

fabianlim

my comments are addressed

tdoublep

LGTM - Thanks!

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

tomeras91 added 6 commits August 19, 2025 14:52

Add failing mamba2 prefill chunking unittest

bb1d32b

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Fix chunked prefill + valren batching bugs in mamba2 triton kernels

5ce3ce7

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

refactor test for readability

7cee118

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Add another failing test case

4b18938

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

fix the failing test case: more careful sequence index handling (+ref…

1fff3d7

…actor to change names for better readability) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Add docstring to somewhat cryptic function

a2101a7

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

tomeras91 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tdoublep, tlrmchlsmth, yewentao256 and ywang96 as code owners August 20, 2025 19:02

mergify bot added the v1 label Aug 20, 2025

gemini-code-assist bot reviewed Aug 20, 2025

View reviewed changes

vllm/model_executor/layers/mamba/ops/ssd_state_passing.py Outdated Show resolved Hide resolved

tomeras91 and others added 2 commits August 20, 2025 22:37

mypy typehint

7d7bf56

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

fix masking when loading chunk offset

6ff01bb

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

tdoublep reviewed Aug 29, 2025

View reviewed changes

fabianlim reviewed Aug 29, 2025

View reviewed changes

tomeras91 added 3 commits September 1, 2025 02:15

fix example in docstring

2bfe36b

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

rename parameter and add documentation

5ad41fc

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Add docstring to test

39bca2d

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

fabianlim approved these changes Sep 7, 2025

View reviewed changes

tdoublep approved these changes Sep 8, 2025

View reviewed changes

tdoublep enabled auto-merge (squash) September 8, 2025 09:15

Merge branch 'main' into fix-mamba2-prefill-chunking

425b075

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 8, 2025

tdoublep merged commit e041314 into vllm-project:main Sep 8, 2025
50 checks passed

tomeras91 deleted the fix-mamba2-prefill-chunking branch September 9, 2025 07:05

netanel-haber added a commit to netanel-haber/sglang that referenced this pull request Sep 28, 2025

integrate chunking fix from vllm: vllm-project/vllm#23279

33eedfc

		exhausted[i] = False


		@pytest.mark.parametrize("chunk_size", [8, 256])

Uh oh!

[Bugfix] Fix mamba2 prefill chunking #23279

[Bugfix] Fix mamba2 prefill chunking #23279

Uh oh!

Conversation

tomeras91 commented Aug 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomeras91 Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fabianlim Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fabianlim left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomeras91 commented Aug 20, 2025 •

edited by github-actions bot

Loading

tomeras91 Aug 31, 2025 •

edited

Loading

fabianlim Sep 1, 2025 •

edited

Loading

tdoublep left a comment •

edited

Loading