[Feature] Allow configuring FlashInfer workspace size (#25342) #25344

mishra-krishna · 2025-09-21T15:33:59Z

The FlashInfer workspace buffer size is currently hardcoded, which can cause overflows for some use cases.
This commit makes the buffer size configurable via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable.

…5342) The FlashInfer workspace buffer size is currently hardcoded, which can cause overflows for some use cases. This commit makes the buffer size configurable via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable.

github-actions · 2025-09-21T15:34:07Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces a way to configure the FlashInfer workspace buffer size via an environment variable, which is a useful enhancement. However, the current implementation has a significant robustness issue: it will crash at import time if the environment variable is set to a non-integer value. This should be handled gracefully, for instance by using a try-except block and falling back to the default value. Additionally, the logic for reading this environment variable is duplicated in three different files. To improve maintainability, this logic should be centralized into a single helper function. I've provided suggestions on the relevant files to address the error handling.

tests/kernels/attention/test_flashinfer_mla_decode.py

vllm/v1/attention/backends/flashinfer.py

vllm/v1/attention/backends/mla/common.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Krishna Misra <86200923+mishra-krishna@users.noreply.github.com>

LucasWilkinson

In what cases does it overflow? we should try to improve the defaults if possible

mishra-krishna · 2025-09-21T16:02:42Z

In what cases does it overflow? we should try to improve the defaults if possible

I am not sure, but I saw an issue with the user having some problems with the defaults, in any case the user should have the freedom to set their own workspace size if necessary and if not it can gracefully use the defaults.

mishra-krishna · 2025-09-21T16:05:04Z

In what cases does it overflow? we should try to improve the defaults if possible

#25342 this is the feature request

mratsim · 2025-09-22T19:20:08Z

I can't have a solid repro but I had crashes with GLM-4.5-Air or Mistral-Large-123b-2411.

I use tensor parallelism = 2 and 128K context size.

When I initially looked into the issue, I saw that SGLang has a larger default and also a configurable ENV variable, see:

[Bug] RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 458752000 and alignment 16 in AlignedAllocator sgl-project/sglang#1405 (comment)
[Bug] RuntimeError: Failed to allocate memory for batch_prefill_tmp_v with size 435814400 and alignment 16 in AlignedAllocator sgl-project/sglang#2102 (comment)

pavanimajety · 2025-09-24T19:57:35Z

tests/kernels/attention/test_flashinfer_mla_decode.py

 from vllm.platforms import current_platform

-FLASHINFER_WORKSPACE_BUFFER_SIZE = 128 * 1024 * 1024
+FLASHINFER_WORKSPACE_BUFFER_SIZE = int(


Please move this to vllm/envs.py

frankwang28 · 2025-10-03T09:44:50Z

In what cases does it overflow? we should try to improve the defaults if possible

@LucasWilkinson, I'm currently getting this when using vllm serve Qwen/Qwen3-32B-FP8 on a B200.

I think this may be potentially related to #25444, where we now capture both FULL_AND_PIECEWISE CUDA graphs.

When launching vLLM, I get past Capturing CUDA graphs (mixed prefill-decode, PIECEWISE) fine, but Capturing CUDA graphs (decode, FULL) ends up causing:

RuntimeError: Error in function 'aligned_alloc' at /home/frank.wang2/vllm/.venv/lib/python3.12/site-packages/flashinfer/data/include/flashinfer/allocator.h:49: Buffer overflow when allocating memory for batch_prefill_tmp_s with size 3141632 and alignment 16, but only 524288 bytes available in AlignedAllocator. Increase the workspace buffer size.

Increasing the buffer size has resolved this issue, as expected.

Would definitely be nice to have better defaults, as you suggested, and/or have this configurable!

mratsim · 2025-10-23T19:52:30Z

I can reliably reproduce a FLASHINFER_WORKSPACE_BUFFER_SIZE crash with Llama3.3 quantized to NVFP4 without tensor parallelism, for example https://huggingface.co/mratsim/Wayfarer-Large-70B-NVFP4.
It disappears when I either use 2 GPUs or switch the buffer size to 512 * 1024 * 1024

remusao · 2025-10-23T19:58:53Z

We could also consistently crash vLLM running Qwen3-235B-A22B-Instruct-2507-FP8 with tensor parallelism 4 on B200s but like the above, increasing the workspace size fixed the issue. It would be great to get this merged soon in order to unblock B200 deployments.

wangshangsam · 2025-10-25T00:20:08Z

cc @Victor49152

We can consistently reproduce the same problem for Qwen3-VL-235B-A22B with DP + EP on a 8xB200 node.
It would be nice if we could prioritize this PR and get it merged.

0xjunhao · 2025-10-29T21:49:28Z

FYI, I was able to consistently reproduce this issue with v0.11.0 on B200 for multiple Qwen3 models, but it no longer occurs for me in the current nightly version (Oct 29).

soodoshll · 2025-10-31T01:07:40Z

Not sure if it's relevant, but this issue isn't triggered with piecewise compilation (#27114). However, with a smaller max_model_len that uses full graphs, the problem appears again.

0xbe7a · 2025-10-31T11:50:51Z

FYI, I was able to consistently reproduce this issue with v0.11.0 on B200 for multiple Qwen3 models, but it no longer occurs for me in the current nightly version (Oct 29).

I am still hitting this with FULL, FULL_AND_PIECEWIESE. Using PIECEWISE or NONE works. I am using the nightly with Qwen3-VL-235B with FP8 on 2xB200 / 2xH200 with TP2

0xjunhao · 2025-10-31T14:28:40Z

FYI, I was able to consistently reproduce this issue with v0.11.0 on B200 for multiple Qwen3 models, but it no longer occurs for me in the current nightly version (Oct 29).

I am still hitting this with FULL, FULL_AND_PIECEWIESE. Using PIECEWISE or NONE works. I am using the nightly with Qwen3-VL-235B with FP8 on 2xB200 / 2xH200 with TP2

I see. Yes, we are using PIECEWISE.

wangshangsam · 2025-11-06T23:39:20Z

At this moment, what is blocking this relatively straighforward PR from merging? I see that @mishra-krishna submitted this PR on Sept 21. @pavanimajety approved it on Oct 3. Today is Nov 6.

Our engineers are doing custom builds/hacking to get around this problem. It would be very nice if people won't have to do that.

voipmonitor · 2025-11-10T01:14:33Z

I have problem with running GLM-4.6 and tool calls - its failing due to low workspace - I'm not even sure if env FLASHINFER_WORKSPACE_BUFFER_SIZE is a good solution - should not be this buffer automatically larger for specific workloads? Why users have to maintain this parameter at all? Besides we need asap merge at least.

mergify · 2025-11-11T16:55:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mishra-krishna.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mratsim · 2025-11-13T13:01:37Z

Now that #25344 is merged this can be closed as duplicate

wangshangsam · 2025-11-15T00:24:12Z

Now that #25344 is merged this can be closed as duplicate

I believe you meant #28269 . I'm closing this PR now.

mishra-krishna requested review from WoosukKwon, alexm-redhat, comaniac, mgoin, njhill, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and ywang96 as code owners September 21, 2025 15:34

mergify bot added the v1 label Sep 21, 2025

gemini-code-assist bot reviewed Sep 21, 2025

View reviewed changes

tests/kernels/attention/test_flashinfer_mla_decode.py Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/mla/common.py Show resolved Hide resolved

Update vllm/v1/attention/backends/flashinfer.py

f0fbde5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Krishna Misra <86200923+mishra-krishna@users.noreply.github.com>

LucasWilkinson reviewed Sep 21, 2025

View reviewed changes

mishra-krishna requested a review from LucasWilkinson September 21, 2025 16:17

pavanimajety reviewed Sep 24, 2025

View reviewed changes

pavanimajety approved these changes Oct 3, 2025

View reviewed changes

dolpm mentioned this pull request Oct 27, 2025

[AOT compilation] support torch.compile inductor artifacts in VllmCompiledFunction #25205

Open

5 tasks

maxyanghu mentioned this pull request Nov 7, 2025

[Feature] Allow configuring FlashInfer workspace size #28269

Merged

mergify bot added the nvidia label Nov 11, 2025

github-project-automation bot added this to NVIDIA Nov 11, 2025

mergify bot added the needs-rebase label Nov 11, 2025

wangshangsam closed this Nov 15, 2025

github-project-automation bot moved this to Done in NVIDIA Nov 15, 2025

Uh oh!

[Feature] Allow configuring FlashInfer workspace size (#25342) #25344

[Feature] Allow configuring FlashInfer workspace size (#25342) #25344

Uh oh!

Conversation

mishra-krishna commented Sep 21, 2025

Uh oh!

github-actions bot commented Sep 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

mishra-krishna commented Sep 21, 2025

Uh oh!

mishra-krishna commented Sep 21, 2025

Uh oh!

mratsim commented Sep 22, 2025

Uh oh!

pavanimajety Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

wangshangsam Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

frankwang28 commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mratsim commented Oct 23, 2025

Uh oh!

remusao commented Oct 23, 2025

Uh oh!

wangshangsam commented Oct 25, 2025

Uh oh!

0xjunhao commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soodoshll commented Oct 31, 2025

Uh oh!

0xbe7a commented Oct 31, 2025

Uh oh!

0xjunhao commented Oct 31, 2025

Uh oh!

wangshangsam commented Nov 6, 2025

Uh oh!

voipmonitor commented Nov 10, 2025

Uh oh!

mergify bot commented Nov 11, 2025

Uh oh!

mratsim commented Nov 13, 2025

Uh oh!

wangshangsam commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

frankwang28 commented Oct 3, 2025 •

edited

Loading

0xjunhao commented Oct 29, 2025 •

edited

Loading