[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu #28760

Flink-ddd · 2025-11-14T22:09:42Z

Purpose

This PR fixes a potential out-of-bounds (OOB) memory access in the gather_and_maybe_dequant_cache CUDA kernel, as originally reported in Issue #27909.

The bug was identified by static analysis. The root cause was that the offset (calculated from seq_starts[bid] / block_size) was not validated against the block_table_stride (the bound of the block table) before being used to access the batch_block_table pointer.

this fix calculates a safe upper bound (loop_end) before the loop starts, using std::min((int64_t)full_blocks_end, block_table_stride - offset). This ensures the loop never iterates past the safe boundary.
For the partial block: A standard bounds check (if (offset + full_blocks_end < block_table_stride)) is added to secure the final OOB access after the loop.

Test Plan

Unit Test Verification:
A new unit test (tests/kernels/test_cache_kernels.py::test_gather_cache_oob) was added to specifically reproduce the exact boundary condition described in the issue (where seq_starts causes the offset to point near the end of the table).
Benchmark & Accuracy Verification
Device: NVIDIA A100 PCIe
Models:
meta-llama/Llama-3.1-8B-Instruct (To verify no regression on Dense models).
Qwen/Qwen1.5-MoE-A2.7B-Chat (To verify fix impact on MoE architectures).
Metrics: lm_eval (GSM8K) for accuracy and vllm bench serve for throughput/latency.

Test Result

Sanitizer Verification
The fix was verified by running the new unit test under compute-sanitizer --tool memcheck.

After fix: The test passes, and compute_sanitizer reports no errors, confirming the OOB access is resolved.

compute-sanitizer --tool memcheck python -m pytest tests/kernels/test_cache_kernels.py

Benchmark & Accuracy Report
I performed a comprehensive benchmark on NVIDIA A100 PCIe.
Llama-3.1-8B (Dense): Shows no regression. Accuracy and performance remain stable, confirming the fix is safe for general use.
Qwen1.5-MoE (MoE): Shows improved accuracy (+4% strict match) and higher throughput (~10%). This suggests the previous OOB issue might have been causing silent data corruption or memory access inefficiencies in MoE logic.

Benchmark Serving (NVIDIA A100 PCIe)

Model	Metric	Before	After	Change (%)
Qwen1.5-MoE-A2.7B-Chat	Output Throughput (tok/s)	1137.14	1260.22	+10.8%
	Total Throughput (tok/s)	2274.29	2520.44	+10.8%
	TTFT (ms)	27306.45	24614.76	+11.0%
	TPOT (ms)	27.61	24.91	+9.8%
	ITL (ms)	27.61	24.91	+9.8%
Llama-3.1-8B-Instruct	Output Throughput (tok/s)	2464.21	2483.39	+0.8%
	Total Throughput (tok/s)	4925.96	4964.25	+0.8%
	TTFT (ms)	5922.80	5972.19	-0.8%
	TPOT (ms)	20.06	19.81	+1.3%
	ITL (ms)	20.06	19.81	+1.3%

LM_EVAL (NVIDIA A100 PCIe, gsm8k)

Model	Metric	Before	After	Change (%)
Qwen1.5-MoE-A2.7B-Chat	Exact Match	0.485	0.515	+6.19%
	Std Err	0.0354	0.0354	+0.00%
	Strict Match	0.360	0.400	+11.11%
	Std Err	0.0340	0.0347	+2.06%
Llama-3.1-8B-Instruct	Exact Match	0.795	0.795	+0.00%
	Std Err	0.0286	0.0286	+0.00%
	Strict Match	0.775	0.775	+0.00%
	Std Err	0.0296	0.0296	+0.00%

Flink-ddd · 2025-11-14T22:11:57Z

/gemini review

gemini-code-assist

Code Review

This pull request correctly addresses a critical out-of-bounds memory access in the gather_and_maybe_dequant_cache CUDA kernel by introducing necessary bounds checks. The inclusion of a new unit test to specifically reproduce and verify the fix is a great addition that improves the robustness of the codebase. My review includes a performance-related suggestion for the CUDA kernel to avoid potential thread divergence introduced by the fix, which will ensure the kernel remains efficient.

gemini-code-assist · 2025-11-14T22:13:08Z

csrc/cache_kernels.cu

  for (int pid = split_start; pid < full_blocks_end; ++pid) {
-    auto block_id = batch_block_table[pid];
-    auto block_start_ptr = src_cache + block_id * cache_block_stride;
-    auto block_dst_ptr = dst + pid * block_size * dst_entry_stride;
-    for (int eid = 0; eid < block_size; ++eid) {
-      copy_entry(block_start_ptr + eid * cache_entry_stride,
+    if (offset + pid < block_table_stride){
+       auto block_id = batch_block_table[pid];
+       auto block_start_ptr = src_cache + block_id * cache_block_stride;
+       auto block_dst_ptr = dst + pid * block_size * dst_entry_stride;
+       for (int eid = 0; eid < block_size; ++eid) {
+        copy_entry(block_start_ptr + eid * cache_entry_stride,
                 block_dst_ptr + eid * dst_entry_stride);
+       }
    }
  }


While adding the if condition inside the loop correctly prevents the out-of-bounds access, it can introduce thread divergence on the GPU, which can negatively impact performance. A more efficient approach in CUDA is to adjust the loop's upper bound to avoid branching within the loop. This ensures all threads in a warp execute the same path.

const auto loop_end = std::min((int64_t)full_blocks_end, block_table_stride - offset); for (int pid = split_start; pid < loop_end; ++pid) { auto block_id = batch_block_table[pid]; auto block_start_ptr = src_cache + block_id * cache_block_stride; auto block_dst_ptr = dst + pid * block_size * dst_entry_stride; for (int eid = 0; eid < block_size; ++eid) { copy_entry(block_start_ptr + eid * cache_entry_stride, block_dst_ptr + eid * dst_entry_stride); } }

Flink-ddd · 2025-11-15T03:14:32Z

@codex review

chatgpt-codex-connector · 2025-11-15T03:17:44Z

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: vensen <vensenmu@gmail.com>

Flink-ddd · 2025-11-15T09:04:57Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively resolves a critical out-of-bounds memory access vulnerability in the gather_and_maybe_dequant_cache CUDA kernel. The fix correctly introduces boundary checks by calculating a safe upper bound for the loop and adding a conditional check for partial block processing. These changes are robust and correctly handle edge cases. The addition of a specific unit test to reproduce the OOB condition is excellent, ensuring this regression won't occur in the future. The implementation is clean and correct, and I have no further suggestions.

Flink-ddd · 2025-11-15T09:07:35Z

@codex review

chatgpt-codex-connector · 2025-11-15T09:09:57Z

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

yewentao256

Thanks for the work!
Could you also run lm_eval to verify the acc and vllm bench to show the E2E performance?

Flink-ddd · 2025-11-15T15:43:46Z

Thanks for the work! Could you also run lm_eval to verify the acc and vllm bench to show the E2E performance?

Absolutely, Thank you for your suggestion. I will run lm_eval and benchmark and update it.

Flink-ddd · 2025-11-18T10:44:54Z

Hi @yewentao256 , Sorry for the late reply. I've been quite busy these past few days. I updated the lm_eval and benchmark tables. If you have time, pls take a look.

yewentao256

LGTM, thanks for the work!

Signed-off-by: Vensenmu <vensenmu@gmail.com>

Flink-ddd · 2025-11-19T02:02:14Z

Hi @yewentao256 , Thanks for your approval.

Signed-off-by: Vensenmu <vensenmu@gmail.com>

Flink-ddd · 2025-11-19T03:48:26Z

Hi @yewentao256 , I have merged the latest changes from main using the Update branch button.
I also investigated the failing CI checks: The external/flaky checks (403 and UCX errors) are unrelated to the code fix. When you have time, could you please click CI run again for me?

Flink-ddd · 2025-11-19T06:49:07Z

Hi @yewentao256 , the latest run failed due to an OOM error in the examples-test. This is seem like an environment resource issue, not a code bug. When you have time, Could you pls check the Buildkite resource allocation or re-queue this test on a node with more GPU memory? Thanks!

Flink-ddd · 2025-11-20T02:09:01Z

Hi @yewentao256 , since the previous CI run failed again due to the OOM error in the examples-test, and the core kernel fix is complete, this seems to be a persistent CI resource limitation. Could you please help adjust the CI configuration to run this specific test on a larger GPU node pool, Thank you!

DarkLight1337 · 2025-11-20T10:52:11Z

Force merging as the OOM isn't related

) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com> Signed-off-by: LuminolT <lumischen01@gmail.com>

) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com>

) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com>

gemini-code-assist bot reviewed Nov 14, 2025

View reviewed changes

WIP: Fix OOB in cache_kernels.cu (Issue vllm-project#27909)

00432bf

Signed-off-by: vensen <vensenmu@gmail.com>

Flink-ddd force-pushed the fix/issue-27909-cache-kernel-oob branch from 2e048d1 to 00432bf Compare November 15, 2025 09:02

gemini-code-assist bot reviewed Nov 15, 2025

View reviewed changes

Flink-ddd marked this pull request as ready for review November 15, 2025 09:12

Flink-ddd requested review from WoosukKwon, mgoin, tlrmchlsmth and yewentao256 as code owners November 15, 2025 09:12

yewentao256 reviewed Nov 15, 2025

View reviewed changes

Flink-ddd requested a review from yewentao256 November 18, 2025 10:45

yewentao256 approved these changes Nov 18, 2025

View reviewed changes

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 18, 2025

resolve CI error

b870df5

Signed-off-by: Vensenmu <vensenmu@gmail.com>

Flink-ddd requested a review from jeejeelee as a code owner November 19, 2025 00:50

mergify bot added the gpt-oss Related to GPT-OSS models label Nov 19, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements Nov 19, 2025

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Nov 19, 2025

github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Nov 19, 2025

Flink-ddd added 2 commits November 19, 2025 11:36

resolve CI error-1

3c747d5

Signed-off-by: Vensenmu <vensenmu@gmail.com>

Merge branch 'main' into fix/issue-27909-cache-kernel-oob

68a5884

Merge branch 'main' into fix/issue-27909-cache-kernel-oob

8a1b796

Flink-ddd added 2 commits November 19, 2025 19:01

Merge branch 'main' into fix/issue-27909-cache-kernel-oob

8078e54

Merge branch 'main' into fix/issue-27909-cache-kernel-oob

8e972c5

Flink-ddd added 2 commits November 20, 2025 15:46

Merge branch 'main' into fix/issue-27909-cache-kernel-oob

a7ec900

Merge branch 'main' into fix/issue-27909-cache-kernel-oob

a04c031

vllm-bot merged commit fb8851f into vllm-project:main Nov 20, 2025
4 of 7 checks passed

github-project-automation bot moved this from Ready to Done in gpt-oss Issues & Enhancements Nov 20, 2025

lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025

[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (vllm-project#28760

f40a9ba

) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com>

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (vllm-project#28760

9eb0640

) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (vllm-project#28760

381b592

) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com>

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu (vllm-project#28760

c814b21

) Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Vensenmu <vensenmu@gmail.com>

Uh oh!

[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu #28760

[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu #28760

Uh oh!

Conversation

Flink-ddd commented Nov 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Benchmark Serving (NVIDIA A100 PCIe)

LM_EVAL (NVIDIA A100 PCIe, gsm8k)

Uh oh!

Flink-ddd commented Nov 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Flink-ddd commented Nov 15, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 15, 2025

Uh oh!

Flink-ddd commented Nov 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Flink-ddd commented Nov 15, 2025

Uh oh!

chatgpt-codex-connector bot commented Nov 15, 2025

Uh oh!

yewentao256 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Flink-ddd commented Nov 15, 2025

Uh oh!

Flink-ddd commented Nov 18, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Flink-ddd commented Nov 19, 2025

Uh oh!

Flink-ddd commented Nov 19, 2025

Uh oh!

Flink-ddd commented Nov 19, 2025

Uh oh!

Flink-ddd commented Nov 20, 2025

Uh oh!

Uh oh!

DarkLight1337 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Flink-ddd commented Nov 14, 2025 •

edited by github-actions bot

Loading

yewentao256 left a comment •

edited

Loading