[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 #21783

RishiAstra · 2025-07-28T18:17:08Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR fixes a correctness bug for varlen Mamba2 SSD (prefill).

Some of the existing tests required a high tolerance such as 50% rtol (see #21379), and a larger test of 768 tokens and 128 chunk_size failed (other tests seem to be only 64 tokens). These issues are fixed in this PR.

Here is an instinctive explanation of the bug:

There was an unnecessary offset in reading dA_cs_m_boundary, causing a "later" (more decay) value to be read if BLOCK_SIZE_M < chunk_size
The effective dA which was offset (subtract) by dA_cs_m_boundary and was therefore "earlier" (less decay)
The scale_m was therefore off, representing less decay
The prev_states or initstates would have too much influence on the current token due to scale_m

Note: we require that chunk_size is a power of 2. The chunk_size does not affect correctness and can be tuned for performance, so it's not useful to have a non-power-of-2 chunk_size since Triton tensors would anyway need to be padded to powers of 2. One of the test cases was removed, and one was adjusted to use chunk_size=16 instead of 17.

After these changes, all tests pass. The test_mamba_chunk_scan_cont_batch tests that used to have 50% rtol in some cases now all use atol, rtol = 5e-3, 5e-3 and the corresponding if and TODO were removed.

Test Plan

Add new larger test size, remove a bad test, adjust a bad test, fix tolerances, and run:
pytest tests/kernels/mamba/test_mamba_ssm_ssd.py

Test Result

All tests pass
342 passed in 933.16s (0:15:33)

(Optional) Documentation Update

None, but an assert will trigger if chunk_size is not a power of 2. This probably does not need to be documented, especially given the assertion message chunk_size must be integer power of 2.

github-actions · 2025-07-28T18:17:17Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request provides a well-reasoned and correctly implemented fix for a correctness bug in the Mamba2 SSD implementation, specifically addressing an issue with initial state decay in variable-length sequences. The core change in the Triton kernel logic appears sound and directly targets the described bug.

The pull request also improves the overall robustness and performance characteristics of the code by enforcing that chunk_size must be a power of two, accompanied by a clear assertion. The test suite has been thoughtfully updated to reflect these changes, with adjusted test cases, tighter tolerances, and a new test to validate the bug fix under scale. The changes are consistent and of high quality.

tdoublep

LGTM

mergify · 2025-08-03T07:59:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RishiAstra.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

RishiAstra · 2025-08-04T17:39:58Z

I rebased it and it still passes all tests

vllm/model_executor/layers/mamba/ops/ssd_combined.py

tlrmchlsmth

Thank you!

RishiAstra · 2025-08-05T14:43:53Z

Mamba2 SSD tests failed on GitHub but worked locally on an A100 and H100. I increased the tolerance slightly based on the GitHub test output. The tolerances are still less than the test_flash_attn.py tolerances.

cyang49 · 2025-08-11T14:45:45Z

@RishiAstra Could you merge from main and retry? Some failures might have been fixed in main

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

RishiAstra · 2025-08-12T01:32:46Z

@cyang49 done, also had rebased 3 days ago but it didn't fix the test failures then. Is it possible to merge with some test failures not related to the bug fix?

DarkLight1337 · 2025-08-12T09:04:31Z

Should be fine to merge

…sert chunk pwr 2 (vllm-project#21783) Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…sert chunk pwr 2 (vllm-project#21783) Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…sert chunk pwr 2 (vllm-project#21783) Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

…sert chunk pwr 2 (vllm-project#21783) Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

…sert chunk pwr 2 (vllm-project#21783) Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

RishiAstra marked this pull request as ready for review July 28, 2025 18:59

RishiAstra requested review from WoosukKwon and tlrmchlsmth as code owners July 28, 2025 18:59

tdoublep approved these changes Jul 29, 2025

View reviewed changes

mergify bot added the needs-rebase label Aug 3, 2025

RishiAstra force-pushed the main branch from 406d9c4 to 5cef3cf Compare August 4, 2025 13:50

mergify bot removed the needs-rebase label Aug 4, 2025

RishiAstra force-pushed the main branch 2 times, most recently from 959185a to 3526dbe Compare August 4, 2025 15:49

tlrmchlsmth reviewed Aug 4, 2025

View reviewed changes

vllm/model_executor/layers/mamba/ops/ssd_combined.py Outdated Show resolved Hide resolved

tlrmchlsmth approved these changes Aug 4, 2025

View reviewed changes

tlrmchlsmth enabled auto-merge (squash) August 4, 2025 19:05

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 4, 2025

cyang49 mentioned this pull request Aug 4, 2025

[Model] Mamba2 varlen and metadata refactor #21467

Merged

4 tasks

auto-merge was automatically disabled August 5, 2025 14:42
Head branch was pushed to by a user without write access

RishiAstra force-pushed the main branch from e5180ce to 4c92955 Compare August 8, 2025 14:56

RishiAstra added 3 commits August 11, 2025 20:47

fix mamba2 ssd bug initstates decay, improve test, assert chunk pwr 2

4470f77

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

small refactor for clarity

d80c9cc

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

higher test tolerance

c49faaf

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

RishiAstra force-pushed the main branch from 4c92955 to c49faaf Compare August 12, 2025 00:48

vllm-bot merged commit 46ae7f6 into vllm-project:main Aug 12, 2025
38 of 42 checks passed

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, as…

0d29b19

…sert chunk pwr 2 (vllm-project#21783) Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

tomeras91 mentioned this pull request Aug 20, 2025

[Bugfix] Fix mamba2 prefill chunking #23279

Merged

4 tasks

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, as…

01b7ce4

…sert chunk pwr 2 (vllm-project#21783) Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, as…

9481322

…sert chunk pwr 2 (vllm-project#21783) Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 #21783

[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 #21783

Uh oh!

RishiAstra commented Jul 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

tdoublep left a comment

Uh oh!

mergify bot commented Aug 3, 2025

Uh oh!

RishiAstra commented Aug 4, 2025

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Uh oh!

RishiAstra commented Aug 5, 2025

Uh oh!

cyang49 commented Aug 11, 2025

Uh oh!

RishiAstra commented Aug 12, 2025

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 #21783

[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 #21783

Uh oh!

Conversation

RishiAstra commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 3, 2025

Uh oh!

RishiAstra commented Aug 4, 2025

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

RishiAstra commented Aug 5, 2025

Uh oh!

cyang49 commented Aug 11, 2025

Uh oh!

RishiAstra commented Aug 12, 2025

Uh oh!

DarkLight1337 commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

RishiAstra commented Jul 28, 2025 •

edited

Loading