Skip to content

Conversation

@RishiAstra
Copy link
Contributor

@RishiAstra RishiAstra commented Jul 28, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR fixes a correctness bug for varlen Mamba2 SSD (prefill).

Some of the existing tests required a high tolerance such as 50% rtol (see #21379), and a larger test of 768 tokens and 128 chunk_size failed (other tests seem to be only 64 tokens). These issues are fixed in this PR.

Here is an instinctive explanation of the bug:

  1. There was an unnecessary offset in reading dA_cs_m_boundary, causing a "later" (more decay) value to be read if BLOCK_SIZE_M < chunk_size
  2. The effective dA which was offset (subtract) by dA_cs_m_boundary and was therefore "earlier" (less decay)
  3. The scale_m was therefore off, representing less decay
  4. The prev_states or initstates would have too much influence on the current token due to scale_m

Note: we require that chunk_size is a power of 2. The chunk_size does not affect correctness and can be tuned for performance, so it's not useful to have a non-power-of-2 chunk_size since Triton tensors would anyway need to be padded to powers of 2. One of the test cases was removed, and one was adjusted to use chunk_size=16 instead of 17.

After these changes, all tests pass. The test_mamba_chunk_scan_cont_batch tests that used to have 50% rtol in some cases now all use atol, rtol = 5e-3, 5e-3 and the corresponding if and TODO were removed.

Test Plan

Add new larger test size, remove a bad test, adjust a bad test, fix tolerances, and run:
pytest tests/kernels/mamba/test_mamba_ssm_ssd.py

Test Result

All tests pass
342 passed in 933.16s (0:15:33)

(Optional) Documentation Update

None, but an assert will trigger if chunk_size is not a power of 2. This probably does not need to be documented, especially given the assertion message chunk_size must be integer power of 2.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a well-reasoned and correctly implemented fix for a correctness bug in the Mamba2 SSD implementation, specifically addressing an issue with initial state decay in variable-length sequences. The core change in the Triton kernel logic appears sound and directly targets the described bug.

The pull request also improves the overall robustness and performance characteristics of the code by enforcing that chunk_size must be a power of two, accompanied by a clear assertion. The test suite has been thoughtfully updated to reflect these changes, with adjusted test cases, tighter tolerances, and a new test to validate the bug fix under scale. The changes are consistent and of high quality.

@RishiAstra RishiAstra marked this pull request as ready for review July 28, 2025 18:59
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify
Copy link

mergify bot commented Aug 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RishiAstra.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@RishiAstra
Copy link
Contributor Author

I rebased it and it still passes all tests

Copy link
Member

@tlrmchlsmth tlrmchlsmth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) August 4, 2025 19:05
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 4, 2025
auto-merge was automatically disabled August 5, 2025 14:42

Head branch was pushed to by a user without write access

@RishiAstra
Copy link
Contributor Author

Mamba2 SSD tests failed on GitHub but worked locally on an A100 and H100. I increased the tolerance slightly based on the GitHub test output. The tolerances are still less than the test_flash_attn.py tolerances.

@cyang49
Copy link
Contributor

cyang49 commented Aug 11, 2025

@RishiAstra Could you merge from main and retry? Some failures might have been fixed in main

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
@RishiAstra
Copy link
Contributor Author

@cyang49 done, also had rebased 3 days ago but it didn't fix the test failures then. Is it possible to merge with some test failures not related to the bug fix?

@DarkLight1337
Copy link
Member

Should be fine to merge

@vllm-bot vllm-bot merged commit 46ae7f6 into vllm-project:main Aug 12, 2025
38 of 42 checks passed
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
…sert chunk pwr 2 (vllm-project#21783)

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
…sert chunk pwr 2 (vllm-project#21783)

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
…sert chunk pwr 2 (vllm-project#21783)

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
…sert chunk pwr 2 (vllm-project#21783)

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
…sert chunk pwr 2 (vllm-project#21783)

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
…sert chunk pwr 2 (vllm-project#21783)

Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants