Skip to content

Conversation

@mishra-krishna
Copy link

The FlashInfer workspace buffer size is currently hardcoded, which can cause overflows for some use cases.
This commit makes the buffer size configurable via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable.

…5342)

The FlashInfer workspace buffer size is currently hardcoded, which can cause overflows for some use cases.
This commit makes the buffer size configurable via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the v1 label Sep 21, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a way to configure the FlashInfer workspace buffer size via an environment variable, which is a useful enhancement. However, the current implementation has a significant robustness issue: it will crash at import time if the environment variable is set to a non-integer value. This should be handled gracefully, for instance by using a try-except block and falling back to the default value. Additionally, the logic for reading this environment variable is duplicated in three different files. To improve maintainability, this logic should be centralized into a single helper function. I've provided suggestions on the relevant files to address the error handling.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Krishna Misra <86200923+mishra-krishna@users.noreply.github.com>
Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases does it overflow? we should try to improve the defaults if possible

@mishra-krishna
Copy link
Author

In what cases does it overflow? we should try to improve the defaults if possible

I am not sure, but I saw an issue with the user having some problems with the defaults, in any case the user should have the freedom to set their own workspace size if necessary and if not it can gracefully use the defaults.

@mishra-krishna
Copy link
Author

In what cases does it overflow? we should try to improve the defaults if possible

#25342 this is the feature request

@mratsim
Copy link

mratsim commented Sep 22, 2025

I can't have a solid repro but I had crashes with GLM-4.5-Air or Mistral-Large-123b-2411.

I use tensor parallelism = 2 and 128K context size.

When I initially looked into the issue, I saw that SGLang has a larger default and also a configurable ENV variable, see:

from vllm.platforms import current_platform

FLASHINFER_WORKSPACE_BUFFER_SIZE = 128 * 1024 * 1024
FLASHINFER_WORKSPACE_BUFFER_SIZE = int(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this to vllm/envs.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@frankwang28
Copy link
Contributor

frankwang28 commented Oct 3, 2025

In what cases does it overflow? we should try to improve the defaults if possible

@LucasWilkinson, I'm currently getting this when using vllm serve Qwen/Qwen3-32B-FP8 on a B200.

I think this may be potentially related to #25444, where we now capture both FULL_AND_PIECEWISE CUDA graphs.

When launching vLLM, I get past Capturing CUDA graphs (mixed prefill-decode, PIECEWISE) fine, but Capturing CUDA graphs (decode, FULL) ends up causing:

RuntimeError: Error in function 'aligned_alloc' at /home/frank.wang2/vllm/.venv/lib/python3.12/site-packages/flashinfer/data/include/flashinfer/allocator.h:49: Buffer overflow when allocating memory for batch_prefill_tmp_s with size 3141632 and alignment 16, but only 524288 bytes available in AlignedAllocator. Increase the workspace buffer size.

Increasing the buffer size has resolved this issue, as expected.

Would definitely be nice to have better defaults, as you suggested, and/or have this configurable!

@mratsim
Copy link

mratsim commented Oct 23, 2025

I can reliably reproduce a FLASHINFER_WORKSPACE_BUFFER_SIZE crash with Llama3.3 quantized to NVFP4 without tensor parallelism, for example https://huggingface.co/mratsim/Wayfarer-Large-70B-NVFP4.
It disappears when I either use 2 GPUs or switch the buffer size to 512 * 1024 * 1024

@remusao
Copy link

remusao commented Oct 23, 2025

We could also consistently crash vLLM running Qwen3-235B-A22B-Instruct-2507-FP8 with tensor parallelism 4 on B200s but like the above, increasing the workspace size fixed the issue. It would be great to get this merged soon in order to unblock B200 deployments.

@wangshangsam
Copy link
Collaborator

cc @Victor49152

We can consistently reproduce the same problem for Qwen3-VL-235B-A22B with DP + EP on a 8xB200 node.
It would be nice if we could prioritize this PR and get it merged.

@0xjunhao
Copy link
Contributor

0xjunhao commented Oct 29, 2025

FYI, I was able to consistently reproduce this issue with v0.11.0 on B200 for multiple Qwen3 models, but it no longer occurs for me in the current nightly version (Oct 29).

@soodoshll
Copy link
Contributor

Not sure if it's relevant, but this issue isn't triggered with piecewise compilation (#27114). However, with a smaller max_model_len that uses full graphs, the problem appears again.

@0xbe7a
Copy link

0xbe7a commented Oct 31, 2025

FYI, I was able to consistently reproduce this issue with v0.11.0 on B200 for multiple Qwen3 models, but it no longer occurs for me in the current nightly version (Oct 29).

I am still hitting this with FULL, FULL_AND_PIECEWIESE. Using PIECEWISE or NONE works. I am using the nightly with Qwen3-VL-235B with FP8 on 2xB200 / 2xH200 with TP2

@0xjunhao
Copy link
Contributor

FYI, I was able to consistently reproduce this issue with v0.11.0 on B200 for multiple Qwen3 models, but it no longer occurs for me in the current nightly version (Oct 29).

I am still hitting this with FULL, FULL_AND_PIECEWIESE. Using PIECEWISE or NONE works. I am using the nightly with Qwen3-VL-235B with FP8 on 2xB200 / 2xH200 with TP2

I see. Yes, we are using PIECEWISE.

@wangshangsam
Copy link
Collaborator

At this moment, what is blocking this relatively straighforward PR from merging? I see that @mishra-krishna submitted this PR on Sept 21. @pavanimajety approved it on Oct 3. Today is Nov 6.

Our engineers are doing custom builds/hacking to get around this problem. It would be very nice if people won't have to do that.

@voipmonitor
Copy link

I have problem with running GLM-4.6 and tool calls - its failing due to low workspace - I'm not even sure if env FLASHINFER_WORKSPACE_BUFFER_SIZE is a good solution - should not be this buffer automatically larger for specific workloads? Why users have to maintain this parameter at all? Besides we need asap merge at least.

@mergify
Copy link

mergify bot commented Nov 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mishra-krishna.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 11, 2025
@mratsim
Copy link

mratsim commented Nov 13, 2025

Now that #25344 is merged this can be closed as duplicate

@wangshangsam
Copy link
Collaborator

Now that #25344 is merged this can be closed as duplicate

I believe you meant #28269 . I'm closing this PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.