-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Feature] Allow configuring FlashInfer workspace size (#25342) #25344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Allow configuring FlashInfer workspace size (#25342) #25344
Conversation
…5342) The FlashInfer workspace buffer size is currently hardcoded, which can cause overflows for some use cases. This commit makes the buffer size configurable via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a way to configure the FlashInfer workspace buffer size via an environment variable, which is a useful enhancement. However, the current implementation has a significant robustness issue: it will crash at import time if the environment variable is set to a non-integer value. This should be handled gracefully, for instance by using a try-except block and falling back to the default value. Additionally, the logic for reading this environment variable is duplicated in three different files. To improve maintainability, this logic should be centralized into a single helper function. I've provided suggestions on the relevant files to address the error handling.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Krishna Misra <86200923+mishra-krishna@users.noreply.github.com>
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what cases does it overflow? we should try to improve the defaults if possible
I am not sure, but I saw an issue with the user having some problems with the defaults, in any case the user should have the freedom to set their own workspace size if necessary and if not it can gracefully use the defaults. |
#25342 this is the feature request |
|
I can't have a solid repro but I had crashes with GLM-4.5-Air or Mistral-Large-123b-2411. I use tensor parallelism = 2 and 128K context size. When I initially looked into the issue, I saw that SGLang has a larger default and also a configurable ENV variable, see:
|
| from vllm.platforms import current_platform | ||
|
|
||
| FLASHINFER_WORKSPACE_BUFFER_SIZE = 128 * 1024 * 1024 | ||
| FLASHINFER_WORKSPACE_BUFFER_SIZE = int( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this to vllm/envs.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@LucasWilkinson, I'm currently getting this when using I think this may be potentially related to #25444, where we now capture both FULL_AND_PIECEWISE CUDA graphs. When launching vLLM, I get past
Increasing the buffer size has resolved this issue, as expected. Would definitely be nice to have better defaults, as you suggested, and/or have this configurable! |
|
I can reliably reproduce a |
|
We could also consistently crash vLLM running Qwen3-235B-A22B-Instruct-2507-FP8 with tensor parallelism 4 on B200s but like the above, increasing the workspace size fixed the issue. It would be great to get this merged soon in order to unblock B200 deployments. |
|
cc @Victor49152 We can consistently reproduce the same problem for Qwen3-VL-235B-A22B with DP + EP on a 8xB200 node. |
|
FYI, I was able to consistently reproduce this issue with v0.11.0 on B200 for multiple Qwen3 models, but it no longer occurs for me in the current nightly version (Oct 29). |
|
Not sure if it's relevant, but this issue isn't triggered with piecewise compilation (#27114). However, with a smaller |
I am still hitting this with FULL, FULL_AND_PIECEWIESE. Using PIECEWISE or NONE works. I am using the nightly with Qwen3-VL-235B with FP8 on 2xB200 / 2xH200 with TP2 |
I see. Yes, we are using PIECEWISE. |
|
At this moment, what is blocking this relatively straighforward PR from merging? I see that @mishra-krishna submitted this PR on Sept 21. @pavanimajety approved it on Oct 3. Today is Nov 6. Our engineers are doing custom builds/hacking to get around this problem. It would be very nice if people won't have to do that. |
|
I have problem with running GLM-4.6 and tool calls - its failing due to low workspace - I'm not even sure if env FLASHINFER_WORKSPACE_BUFFER_SIZE is a good solution - should not be this buffer automatically larger for specific workloads? Why users have to maintain this parameter at all? Besides we need asap merge at least. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Now that #25344 is merged this can be closed as duplicate |
The FlashInfer workspace buffer size is currently hardcoded, which can cause overflows for some use cases.
This commit makes the buffer size configurable via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable.