Skip to content

Conversation

@Flink-ddd
Copy link
Contributor

@Flink-ddd Flink-ddd commented Nov 14, 2025

Purpose

This PR fixes a potential out-of-bounds (OOB) memory access in the gather_and_maybe_dequant_cache CUDA kernel, as originally reported in Issue #27909.

The bug was identified by static analysis. The root cause was that the offset (calculated from seq_starts[bid] / block_size) was not validated against the block_table_stride (the bound of the block table) before being used to access the batch_block_table pointer.

  1. this fix calculates a safe upper bound (loop_end) before the loop starts, using std::min((int64_t)full_blocks_end, block_table_stride - offset). This ensures the loop never iterates past the safe boundary.
  2. For the partial block: A standard bounds check (if (offset + full_blocks_end < block_table_stride)) is added to secure the final OOB access after the loop.

Test Plan

  1. Unit Test Verification:
    A new unit test (tests/kernels/test_cache_kernels.py::test_gather_cache_oob) was added to specifically reproduce the exact boundary condition described in the issue (where seq_starts causes the offset to point near the end of the table).

  2. Benchmark & Accuracy Verification
    Device: NVIDIA A100 PCIe
    Models:
    meta-llama/Llama-3.1-8B-Instruct (To verify no regression on Dense models).
    Qwen/Qwen1.5-MoE-A2.7B-Chat (To verify fix impact on MoE architectures).
    Metrics: lm_eval (GSM8K) for accuracy and vllm bench serve for throughput/latency.

Test Result

  1. Sanitizer Verification
    The fix was verified by running the new unit test under compute-sanitizer --tool memcheck.

After fix: The test passes, and compute_sanitizer reports no errors, confirming the OOB access is resolved.

compute-sanitizer --tool memcheck python -m pytest tests/kernels/test_cache_kernels.py

Screenshot 2025-11-15 at 11 24 32
  1. Benchmark & Accuracy Report
    I performed a comprehensive benchmark on NVIDIA A100 PCIe.
    Llama-3.1-8B (Dense): Shows no regression. Accuracy and performance remain stable, confirming the fix is safe for general use.
    Qwen1.5-MoE (MoE): Shows improved accuracy (+4% strict match) and higher throughput (~10%). This suggests the previous OOB issue might have been causing silent data corruption or memory access inefficiencies in MoE logic.

Benchmark Serving (NVIDIA A100 PCIe)

Model Metric Before After Change (%)
Qwen1.5-MoE-A2.7B-Chat Output Throughput (tok/s) 1137.14 1260.22 +10.8%
Total Throughput (tok/s) 2274.29 2520.44 +10.8%
TTFT (ms) 27306.45 24614.76 +11.0%
TPOT (ms) 27.61 24.91 +9.8%
ITL (ms) 27.61 24.91 +9.8%
Llama-3.1-8B-Instruct Output Throughput (tok/s) 2464.21 2483.39 +0.8%
Total Throughput (tok/s) 4925.96 4964.25 +0.8%
TTFT (ms) 5922.80 5972.19 -0.8%
TPOT (ms) 20.06 19.81 +1.3%
ITL (ms) 20.06 19.81 +1.3%

LM_EVAL (NVIDIA A100 PCIe, gsm8k)

Model Metric Before After Change (%)
Qwen1.5-MoE-A2.7B-Chat Exact Match 0.485 0.515 +6.19%
Std Err 0.0354 0.0354 +0.00%
Strict Match 0.360 0.400 +11.11%
Std Err 0.0340 0.0347 +2.06%
Llama-3.1-8B-Instruct Exact Match 0.795 0.795 +0.00%
Std Err 0.0286 0.0286 +0.00%
Strict Match 0.775 0.775 +0.00%
Std Err 0.0296 0.0296 +0.00%

@Flink-ddd
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a critical out-of-bounds memory access in the gather_and_maybe_dequant_cache CUDA kernel by introducing necessary bounds checks. The inclusion of a new unit test to specifically reproduce and verify the fix is a great addition that improves the robustness of the codebase. My review includes a performance-related suggestion for the CUDA kernel to avoid potential thread divergence introduced by the fix, which will ensure the kernel remains efficient.

Comment on lines 968 to 978
for (int pid = split_start; pid < full_blocks_end; ++pid) {
auto block_id = batch_block_table[pid];
auto block_start_ptr = src_cache + block_id * cache_block_stride;
auto block_dst_ptr = dst + pid * block_size * dst_entry_stride;
for (int eid = 0; eid < block_size; ++eid) {
copy_entry(block_start_ptr + eid * cache_entry_stride,
if (offset + pid < block_table_stride){
auto block_id = batch_block_table[pid];
auto block_start_ptr = src_cache + block_id * cache_block_stride;
auto block_dst_ptr = dst + pid * block_size * dst_entry_stride;
for (int eid = 0; eid < block_size; ++eid) {
copy_entry(block_start_ptr + eid * cache_entry_stride,
block_dst_ptr + eid * dst_entry_stride);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While adding the if condition inside the loop correctly prevents the out-of-bounds access, it can introduce thread divergence on the GPU, which can negatively impact performance. A more efficient approach in CUDA is to adjust the loop's upper bound to avoid branching within the loop. This ensures all threads in a warp execute the same path.

  const auto loop_end = std::min((int64_t)full_blocks_end, block_table_stride - offset);
  for (int pid = split_start; pid < loop_end; ++pid) {
    auto block_id = batch_block_table[pid];
    auto block_start_ptr = src_cache + block_id * cache_block_stride;
    auto block_dst_ptr = dst + pid * block_size * dst_entry_stride;
    for (int eid = 0; eid < block_size; ++eid) {
      copy_entry(block_start_ptr + eid * cache_entry_stride,
                 block_dst_ptr + eid * dst_entry_stride);
    }
  }

@Flink-ddd
Copy link
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: vensen <vensenmu@gmail.com>
@Flink-ddd Flink-ddd force-pushed the fix/issue-27909-cache-kernel-oob branch from 2e048d1 to 00432bf Compare November 15, 2025 09:02
@Flink-ddd
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a critical out-of-bounds memory access vulnerability in the gather_and_maybe_dequant_cache CUDA kernel. The fix correctly introduces boundary checks by calculating a safe upper bound for the loop and adding a conditional check for partial block processing. These changes are robust and correctly handle edge cases. The addition of a specific unit test to reproduce the OOB condition is excellent, ensuring this regression won't occur in the future. The implementation is clean and correct, and I have no further suggestions.

@Flink-ddd
Copy link
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@Flink-ddd Flink-ddd marked this pull request as ready for review November 15, 2025 09:12
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!
Could you also run lm_eval to verify the acc and vllm bench to show the E2E performance?

@Flink-ddd
Copy link
Contributor Author

Thanks for the work! Could you also run lm_eval to verify the acc and vllm bench to show the E2E performance?

Absolutely, Thank you for your suggestion. I will run lm_eval and benchmark and update it.

@Flink-ddd
Copy link
Contributor Author

Hi @yewentao256 , Sorry for the late reply. I've been quite busy these past few days. I updated the lm_eval and benchmark tables. If you have time, pls take a look.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 18, 2025
Signed-off-by: Vensenmu <vensenmu@gmail.com>
@Flink-ddd
Copy link
Contributor Author

Hi @yewentao256 , Thanks for your approval.

@Flink-ddd
Copy link
Contributor Author

Hi @yewentao256 , I have merged the latest changes from main using the Update branch button.
I also investigated the failing CI checks: The external/flaky checks (403 and UCX errors) are unrelated to the code fix. When you have time, could you please click CI run again for me?

@Flink-ddd
Copy link
Contributor Author

Hi @yewentao256 , the latest run failed due to an OOM error in the examples-test. This is seem like an environment resource issue, not a code bug. When you have time, Could you pls check the Buildkite resource allocation or re-queue this test on a node with more GPU memory? Thanks!

@Flink-ddd
Copy link
Contributor Author

Hi @yewentao256 , since the previous CI run failed again due to the OOM error in the examples-test, and the core kernel fix is complete, this seems to be a persistent CI resource limitation. Could you please help adjust the CI configuration to run this specific test on a larger GPU node pool, Thank you!

@vllm-bot vllm-bot merged commit fb8851f into vllm-project:main Nov 20, 2025
4 of 7 checks passed
@DarkLight1337
Copy link
Member

Force merging as the OOM isn't related

LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025
)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
Signed-off-by: LuminolT <lumischen01@gmail.com>
lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025
)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025
)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
)

Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants