[Feature] Allow configuring FlashInfer workspace size #28269

maxyanghu · 2025-11-07T05:33:13Z

Resubmission of #25344

Add VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable to configure FlashInfer workspace size

Signed-off-by: Max Hu <hyoung2991@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a configurable workspace size for FlashInfer via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable. This is a valuable change as it centralizes configuration and removes hardcoded constants from multiple files, improving maintainability. The implementation correctly replaces the previous constants with the new environment variable. I have one suggestion to further improve maintainability by removing a magic number used for the default value.

vllm/envs.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/attention/backends/mla/common.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Max Hu <hyoung2991@gmail.com>

Signed-off-by: Max Hu <hyoung2991@gmail.com>

vllm/v1/attention/backends/mla/common.py

pavanimajety · 2025-11-07T18:25:52Z

@maxyanghu Could you please post the evals for a model that would need higher than the default size using the new env var?

benchislett

This looks good to me overall. I do think the best approach here is to determine this limit dynamically based on max batch size and whichever other parameters actually affect the required workspace buffer size. FlashInfer should probably implement this API and then expose it to vLLM. In the meantime, I welcome this patch.

Signed-off-by: Max Hu <hyoung2991@gmail.com>

maxyanghu · 2025-11-07T19:12:49Z

@maxyanghu Could you please post the evals for a model that would need higher than the default size using the new env var?

If you are referring to the buffer size, The only model I've been running is RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4, and i'm customizing CUDA graph capturing size to the maximum of 8192. The buffer size needed is 6 * 256 * 1024 * 1024

benchislett

LGTM, Thanks!

mratsim · 2025-11-07T19:50:47Z

This https://huggingface.co/mratsim/Wayfarer-Large-70B-NVFP4 needs 512 * 1024 * 1024 or reliably crashes on startup.

GLM-4.5-Air quantized with the following recipe needs 768 * 1024 * 1024

default_stage:
  default_modifiers:
    AWQModifier:
      config_groups:
        group_0:
          targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            block_structure: null
            dynamic: false
            actorder: null
            observer: mse
            observer_kwargs: {}
          input_activations: null
          output_activations: null
          format: null
      targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
      ignore: []
      mappings:
      - smooth_layer: re:.*post_attention_layernorm$
        balance_layers: ['re:.*gate_proj$', 're:.*up_proj$']
      - smooth_layer: re:.*up_proj$
        balance_layers: ['re:.*down_proj$']
      duo_scaling: true

mratsim · 2025-11-07T20:06:18Z

vllm/envs.py

    VLLM_USE_FLASHINFER_MOE_FP8: bool = False
    VLLM_USE_FLASHINFER_MOE_FP4: bool = False
    VLLM_FLASHINFER_MOE_BACKEND: Literal["throughput", "latency"] = "latency"
+    VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 394 * 1024 * 1024


394 is a strange default.

I assume you meant 384 which is 128+256

Suggested change

VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 394 * 1024 * 1024

VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 384 * 1024 * 1024

This is directly copied from here, a known use case.

And the sglang link mentioned https://github.com/sgl-project/sglang/blob/766392c6bda2558b61ce6d1c1bfd8081a549e1f1/python/sglang/global_config.py#L37 is using 384 not 394

@mratsim Hi, yeah I saw the link but I'd rather keep it unchanged as I don't know the exact amount of memory it requires.

@mratsim The point of this PR is so that you could set the value of VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE for your specific use case, which is often not the same as the default.

My point is that if copying SG-Lang default, we should copy it properly i.e. 384 not a strange 394 that will waste space in the allocator due to not being close to a power of 2 or sum of powers of 2:

384 = 2⁸+2⁷

394 = 2⁸+2⁷+2³+2

Signed-off-by: Max Hu <hyoung2991@gmail.com>

mergify · 2025-11-10T17:21:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxyanghu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…-workspace

…8269) Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…8269) Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>

…8269) Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

add implementation

4b8629c

Signed-off-by: Max Hu <hyoung2991@gmail.com>

maxyanghu requested review from mgoin and pavanimajety as code owners November 7, 2025 05:33

mergify bot added the v1 label Nov 7, 2025

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

vllm/envs.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 7, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Show resolved Hide resolved

maxyanghu and others added 2 commits November 7, 2025 00:40

Update vllm/envs.py

32af1e8

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Max Hu <hyoung2991@gmail.com>

Merge branch 'main' into feature/configurable-flashinfer-workspace

1d12ea2

Signed-off-by: Max Hu <hyoung2991@gmail.com>

pavanimajety reviewed Nov 7, 2025

View reviewed changes

vllm/v1/attention/backends/mla/common.py Show resolved Hide resolved

benchislett reviewed Nov 7, 2025

View reviewed changes

maxyanghu added 2 commits November 7, 2025 18:46

expand buffer size default

4985db4

Signed-off-by: Max Hu <hyoung2991@gmail.com>

expand

86127e6

Signed-off-by: Max Hu <hyoung2991@gmail.com>

maxyanghu requested review from benchislett and pavanimajety November 7, 2025 19:14

benchislett approved these changes Nov 7, 2025

View reviewed changes

mratsim reviewed Nov 7, 2025

View reviewed changes

mgoin approved these changes Nov 8, 2025

View reviewed changes

mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Nov 8, 2025

github-project-automation bot added this to NVIDIA Nov 8, 2025

Merge branch 'main' into feature/configurable-flashinfer-workspace

f13f7d7

Signed-off-by: Max Hu <hyoung2991@gmail.com>

mergify bot added the needs-rebase label Nov 10, 2025

Merge branch 'vllm-project:main' into feature/configurable-flashinfer…

d21209f

…-workspace

mergify bot removed the needs-rebase label Nov 10, 2025

maxyanghu added 2 commits November 10, 2025 17:50

Merge branch 'main' into feature/configurable-flashinfer-workspace

88247d6

Merge branch 'main' into feature/configurable-flashinfer-workspace

76ceef7

Merge branch 'main' into feature/configurable-flashinfer-workspace

669071e

pavanimajety enabled auto-merge (squash) November 11, 2025 19:42

pavanimajety merged commit 412e153 into vllm-project:main Nov 11, 2025
55 checks passed

github-project-automation bot moved this to Done in NVIDIA Nov 11, 2025

wangshangsam mentioned this pull request Nov 15, 2025

[Feature] Allow configuring FlashInfer workspace size (#25342) #25344

Closed

	VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 394 * 1024 * 1024
	VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 384 * 1024 * 1024

Uh oh!

[Feature] Allow configuring FlashInfer workspace size #28269

[Feature] Allow configuring FlashInfer workspace size #28269

Conversation

maxyanghu commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

pavanimajety commented Nov 7, 2025

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

maxyanghu commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

mratsim commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mratsim Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

maxyanghu Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

mratsim Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

maxyanghu Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

wangshangsam Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

mratsim Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

maxyanghu commented Nov 7, 2025 •

edited by github-actions bot

Loading

maxyanghu commented Nov 7, 2025 •

edited

Loading

mratsim commented Nov 7, 2025 •

edited

Loading