-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu #28760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][cache_kernels]: Fix OOB in cache_kernels.cu #28760
Conversation
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a critical out-of-bounds memory access in the gather_and_maybe_dequant_cache CUDA kernel by introducing necessary bounds checks. The inclusion of a new unit test to specifically reproduce and verify the fix is a great addition that improves the robustness of the codebase. My review includes a performance-related suggestion for the CUDA kernel to avoid potential thread divergence introduced by the fix, which will ensure the kernel remains efficient.
csrc/cache_kernels.cu
Outdated
| for (int pid = split_start; pid < full_blocks_end; ++pid) { | ||
| auto block_id = batch_block_table[pid]; | ||
| auto block_start_ptr = src_cache + block_id * cache_block_stride; | ||
| auto block_dst_ptr = dst + pid * block_size * dst_entry_stride; | ||
| for (int eid = 0; eid < block_size; ++eid) { | ||
| copy_entry(block_start_ptr + eid * cache_entry_stride, | ||
| if (offset + pid < block_table_stride){ | ||
| auto block_id = batch_block_table[pid]; | ||
| auto block_start_ptr = src_cache + block_id * cache_block_stride; | ||
| auto block_dst_ptr = dst + pid * block_size * dst_entry_stride; | ||
| for (int eid = 0; eid < block_size; ++eid) { | ||
| copy_entry(block_start_ptr + eid * cache_entry_stride, | ||
| block_dst_ptr + eid * dst_entry_stride); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While adding the if condition inside the loop correctly prevents the out-of-bounds access, it can introduce thread divergence on the GPU, which can negatively impact performance. A more efficient approach in CUDA is to adjust the loop's upper bound to avoid branching within the loop. This ensures all threads in a warp execute the same path.
const auto loop_end = std::min((int64_t)full_blocks_end, block_table_stride - offset);
for (int pid = split_start; pid < loop_end; ++pid) {
auto block_id = batch_block_table[pid];
auto block_start_ptr = src_cache + block_id * cache_block_stride;
auto block_dst_ptr = dst + pid * block_size * dst_entry_stride;
for (int eid = 0; eid < block_size; ++eid) {
copy_entry(block_start_ptr + eid * cache_entry_stride,
block_dst_ptr + eid * dst_entry_stride);
}
}
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Signed-off-by: vensen <vensenmu@gmail.com>
2e048d1 to
00432bf
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a critical out-of-bounds memory access vulnerability in the gather_and_maybe_dequant_cache CUDA kernel. The fix correctly introduces boundary checks by calculating a safe upper bound for the loop and adding a conditional check for partial block processing. These changes are robust and correctly handle edge cases. The addition of a specific unit test to reproduce the OOB condition is excellent, ensuring this regression won't occur in the future. The implementation is clean and correct, and I have no further suggestions.
|
@codex review |
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work!
Could you also run lm_eval to verify the acc and vllm bench to show the E2E performance?
Absolutely, Thank you for your suggestion. I will run lm_eval and benchmark and update it. |
|
Hi @yewentao256 , Sorry for the late reply. I've been quite busy these past few days. I updated the lm_eval and benchmark tables. If you have time, pls take a look. |
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
Signed-off-by: Vensenmu <vensenmu@gmail.com>
|
Hi @yewentao256 , Thanks for your approval. |
Signed-off-by: Vensenmu <vensenmu@gmail.com>
|
Hi @yewentao256 , I have merged the latest changes from main using the Update branch button. |
|
Hi @yewentao256 , the latest run failed due to an OOM error in the examples-test. This is seem like an environment resource issue, not a code bug. When you have time, Could you pls check the Buildkite resource allocation or re-queue this test on a node with more GPU memory? Thanks! |
|
Hi @yewentao256 , since the previous CI run failed again due to the OOM error in the examples-test, and the core kernel fix is complete, this seems to be a persistent CI resource limitation. Could you please help adjust the CI configuration to run this specific test on a larger GPU node pool, Thank you! |
|
Force merging as the OOM isn't related |
Purpose
This PR fixes a potential out-of-bounds (OOB) memory access in the gather_and_maybe_dequant_cache CUDA kernel, as originally reported in Issue #27909.
The bug was identified by static analysis. The root cause was that the offset (calculated from seq_starts[bid] / block_size) was not validated against the block_table_stride (the bound of the block table) before being used to access the batch_block_table pointer.
Test Plan
Unit Test Verification:
A new unit test (tests/kernels/test_cache_kernels.py::test_gather_cache_oob) was added to specifically reproduce the exact boundary condition described in the issue (where seq_starts causes the offset to point near the end of the table).
Benchmark & Accuracy Verification
Device: NVIDIA A100 PCIe
Models:
meta-llama/Llama-3.1-8B-Instruct (To verify no regression on Dense models).
Qwen/Qwen1.5-MoE-A2.7B-Chat (To verify fix impact on MoE architectures).
Metrics: lm_eval (GSM8K) for accuracy and vllm bench serve for throughput/latency.
Test Result
The fix was verified by running the new unit test under compute-sanitizer --tool memcheck.
After fix: The test passes, and compute_sanitizer reports no errors, confirming the OOB access is resolved.
compute-sanitizer --tool memcheck python -m pytest tests/kernels/test_cache_kernels.pyI performed a comprehensive benchmark on NVIDIA A100 PCIe.
Llama-3.1-8B (Dense): Shows no regression. Accuracy and performance remain stable, confirming the fix is safe for general use.
Qwen1.5-MoE (MoE): Shows improved accuracy (+4% strict match) and higher throughput (~10%). This suggests the previous OOB issue might have been causing silent data corruption or memory access inefficiencies in MoE logic.
Benchmark Serving (NVIDIA A100 PCIe)
LM_EVAL (NVIDIA A100 PCIe, gsm8k)