-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Feature] Allow configuring FlashInfer workspace size #28269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Allow configuring FlashInfer workspace size #28269
Conversation
Signed-off-by: Max Hu <hyoung2991@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a configurable workspace size for FlashInfer via the VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE environment variable. This is a valuable change as it centralizes configuration and removes hardcoded constants from multiple files, improving maintainability. The implementation correctly replaces the previous constants with the new environment variable. I have one suggestion to further improve maintainability by removing a magic number used for the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Max Hu <hyoung2991@gmail.com>
Signed-off-by: Max Hu <hyoung2991@gmail.com>
|
@maxyanghu Could you please post the evals for a model that would need higher than the default size using the new env var? |
benchislett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me overall. I do think the best approach here is to determine this limit dynamically based on max batch size and whichever other parameters actually affect the required workspace buffer size. FlashInfer should probably implement this API and then expose it to vLLM. In the meantime, I welcome this patch.
Signed-off-by: Max Hu <hyoung2991@gmail.com>
If you are referring to the buffer size, The only model I've been running is RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4, and i'm customizing CUDA graph capturing size to the maximum of 8192. The buffer size needed is 6 * 256 * 1024 * 1024 |
benchislett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks!
|
This https://huggingface.co/mratsim/Wayfarer-Large-70B-NVFP4 needs GLM-4.5-Air quantized with the following recipe needs default_stage:
default_modifiers:
AWQModifier:
config_groups:
group_0:
targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
weights:
num_bits: 4
type: int
symmetric: true
group_size: 32
strategy: group
block_structure: null
dynamic: false
actorder: null
observer: mse
observer_kwargs: {}
input_activations: null
output_activations: null
format: null
targets: ['re:.*mlp\.experts\.[0-9]+\.(down|gate|up)_proj$']
ignore: []
mappings:
- smooth_layer: re:.*post_attention_layernorm$
balance_layers: ['re:.*gate_proj$', 're:.*up_proj$']
- smooth_layer: re:.*up_proj$
balance_layers: ['re:.*down_proj$']
duo_scaling: true
|
| VLLM_USE_FLASHINFER_MOE_FP8: bool = False | ||
| VLLM_USE_FLASHINFER_MOE_FP4: bool = False | ||
| VLLM_FLASHINFER_MOE_BACKEND: Literal["throughput", "latency"] = "latency" | ||
| VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 394 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
394 is a strange default.
I assume you meant 384 which is 128+256
| VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 394 * 1024 * 1024 | |
| VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE: int = 384 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is directly copied from here, a known use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the sglang link mentioned https://github.com/sgl-project/sglang/blob/766392c6bda2558b61ce6d1c1bfd8081a549e1f1/python/sglang/global_config.py#L37 is using 384 not 394
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mratsim Hi, yeah I saw the link but I'd rather keep it unchanged as I don't know the exact amount of memory it requires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mratsim The point of this PR is so that you could set the value of VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE for your specific use case, which is often not the same as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that if copying SG-Lang default, we should copy it properly i.e. 384 not a strange 394 that will waste space in the allocator due to not being close to a power of 2 or sum of powers of 2:
- 384 = 2⁸+2⁷
- 394 = 2⁸+2⁷+2³+2
Signed-off-by: Max Hu <hyoung2991@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
…8269) Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…8269) Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>
…8269) Signed-off-by: Max Hu <hyoung2991@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Resubmission of #25344
Add
VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZEenvironment variable to configure FlashInfer workspace size