Skip to content

Conversation

@maxyanghu
Copy link
Contributor

@maxyanghu maxyanghu commented Nov 7, 2025

Resubmission of #25344

Add VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable to configure FlashInfer workspace size

Signed-off-by: Max Hu <hyoung2991@gmail.com>
@mergify mergify bot added the v1 label Nov 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable workspace size for FlashInfer via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable. This is a valuable change as it centralizes configuration and removes hardcoded constants from multiple files, improving maintainability. The implementation correctly replaces the previous constants with the new environment variable. I have one suggestion to further improve maintainability by removing a magic number used for the default value.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

maxyanghu and others added 2 commits November 7, 2025 00:40
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
@pavanimajety
Copy link
Collaborator

@maxyanghu Could you please post the evals for a model that would need higher than the default size using the new env var?

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me overall. I do think the best approach here is to determine this limit dynamically based on max batch size and whichever other parameters actually affect the required workspace buffer size. FlashInfer should probably implement this API and then expose it to vLLM. In the meantime, I welcome this patch.

Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
@maxyanghu
Copy link
Contributor Author

maxyanghu commented Nov 7, 2025

@maxyanghu Could you please post the evals for a model that would need higher than the default size using the new env var?

If you are referring to the buffer size, The only model I've been running is RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4, and i'm customizing CUDA graph capturing size to the maximum of 8192. The buffer size needed is 6 * 256 * 1024 * 1024

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

@mratsim
Copy link

mratsim commented Nov 7, 2025

This https://huggingface.co/mratsim/Wayfarer-Large-70B-NVFP4 needs 512 * 1024 * 1024 or reliably crashes on startup.

GLM-4.5-Air quantized with the following recipe needs 768 * 1024 * 1024

default_stage:
  default_modifiers:
    AWQModifier:
      config_groups:
        group_0:
          targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            block_structure: null
            dynamic: false
            actorder: null
            observer: mse
            observer_kwargs: {}
          input_activations: null
          output_activations: null
          format: null
      targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
      ignore: []
      mappings:
      - smooth_layer: re:.*post_attention_layernorm$
        balance_layers: ['re:.*gate_proj$', 're:.*up_proj$']
      - smooth_layer: re:.*up_proj$
        balance_layers: ['re:.*down_proj$']
      duo_scaling: true
image

VLLM_USE_FLASHINFER_MOE_FP8: bool = False
VLLM_USE_FLASHINFER_MOE_FP4: bool = False
VLLM_FLASHINFER_MOE_BACKEND: Literal["throughput", "latency"] = "latency"
VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 394 * 1024 * 1024
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

394 is a strange default.

I assume you meant 384 which is 128+256

Suggested change
VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 394 * 1024 * 1024
VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 384 * 1024 * 1024

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is directly copied from here, a known use case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mratsim Hi, yeah I saw the link but I'd rather keep it unchanged as I don't know the exact amount of memory it requires.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mratsim The point of this PR is so that you could set the value of VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE for your specific use case, which is often not the same as the default.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that if copying SG-Lang default, we should copy it properly i.e. 384 not a strange 394 that will waste space in the allocator due to not being close to a power of 2 or sum of powers of 2:

  • 384 = 2⁸+2⁷
  • 394 = 2⁸+2⁷+2³+2

@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed nvidia labels Nov 8, 2025
Signed-off-by: Max Hu <hyoung2991@gmail.com>
@mergify
Copy link

mergify bot commented Nov 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxyanghu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 10, 2025
@mergify mergify bot removed the needs-rebase label Nov 10, 2025
@pavanimajety pavanimajety enabled auto-merge (squash) November 11, 2025 19:42
@pavanimajety pavanimajety merged commit 412e153 into vllm-project:main Nov 11, 2025
55 checks passed
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Nov 11, 2025
fangyuchu pushed a commit to fangyuchu/vllm that referenced this pull request Nov 12, 2025
…8269)

Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
…8269)

Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: George D. Torres <gdavtor@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…8269)

Signed-off-by: Max Hu <hyoung2991@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants