-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072 #27114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072 #27114
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a bug fix that uses PIECEWISE cudagraphs on Blackwell architecture if the max_model_len exceeds 131072. The code changes modify the VllmConfig class to check for this condition and override the cudagraph_mode accordingly. The changes also include adding warning messages to the logger.
Signed-off-by: mgoin <mgoin64@gmail.com>
ywang96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing - thank you!
Signed-off-by: mgoin <mgoin64@gmail.com>
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
|
Added this issue to Flashinfer to track the long term fix for FULL CG support of TRTLLM backend |
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
Purpose
FIX #27057
The original issue was found because Qwen3-VL models completely lost accuracy (1% vs 86% on GSM8K) on B200 GPUs when using the default FULL_AND_PIECEWISE cudagraph_mode. The issue did not occur on Hopper at all, with PIECEWISE mode only, FlashAttention backend, or when explicitly disabling TRTLLM attention.
Because TRTLLM attention is selected dynamically based on runtime conditions
(num_tokens, max_seq_len, kv_cache_dtype). During FULL CG capture, themax_seq_lenis used which when greater than 128K results in FlashInfer being selected, but during actual inference without using full context length, the same conditions triggered TRTLLM selection. This created a graph/runtime mismatch where captured graphs referenced FlashInfer kernels but runtime attempted to execute TRTLLM kernels, producing incorrect results. I was able to see this behavior on any model with defaultmax_model_len>128KBy enforcing PIECEWISE mode in this PR to disable cuda graph capture of attention, we can avoid this issue of dynamism. In the future we should see if we can made TRTLLM support larger context lengths to support FULL graphs
Test Plan
Test Result
Reproduction on B200 on
main:Running on this PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.