-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes #27439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes #27439
Conversation
b33adab to
cd262df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Lines 1171 to 1180 in 34d8036
| # Flashinfer MoE backend for vLLM's fused Mixture-of-Experts support. | |
| # Both require compute capability 10.0 or above. | |
| # Available options: | |
| # - "throughput": [default] | |
| # Uses CUTLASS kernels optimized for high-throughput batch inference. | |
| # - "latency": | |
| # Uses TensorRT-LLM kernels optimized for low-latency inference. | |
| "VLLM_FLASHINFER_MOE_BACKEND": env_with_choices( | |
| "VLLM_FLASHINFER_MOE_BACKEND", "throughput", ["throughput", "latency"] | |
| ), |
The commit message claims to switch the default FlashInfer MoE backend to the latency‑optimised kernels, but the actual environment configuration still declares env_with_choices("VLLM_FLASHINFER_MOE_BACKEND", "throughput", ...). Only the type hint at the top of the file was changed, so the runtime default remains "throughput" and the code never adopts the intended latency backend unless the user sets the variable manually.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
1fb4b91 to
0647d1e
Compare
bnellnm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
|
||
| namespace vllm { | ||
|
|
||
| #define round_up(x, y) ((x + y - 1) / y * y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: macro name convention is ALL_CAPS_WITH_UNDERSCORE. But better yet, don't use a macro and use
template<typename Int>
__host__ __device__ static Int round_up(Int x, Int y)
{
static_assert(std::is_integral_v<Int>, "round_up argument must be integral type");
return (x + y - 1) / y * y;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, fixed in 0c22d3c
0ff1f63 to
0c22d3c
Compare
|
The failed lm-eval-small-models test passes locally- Log for test_gsm8k_correctness
`test_response_api_mcp_tools.py::test_mcp_tool_env_flag_enabled[openai/gpt-oss-20b]`
-- |
|
Test failures are unrelated to the PR |
...del_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a4_nvfp4.py
Outdated
Show resolved
Hide resolved
| "VLLM_FLASHINFER_MOE_BACKEND": env_with_choices( | ||
| "VLLM_FLASHINFER_MOE_BACKEND", "throughput", ["throughput", "latency"] | ||
| "VLLM_FLASHINFER_MOE_BACKEND", "latency", ["throughput", "latency"] | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the case that both fp8 and nvfp4 throughput won't be affected by this? I see you tested for nvfp4, but this will affect several quantized moe cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, we see good perf with trtllm kernels across the board. We also have this [PR]([feat] Refactor trtllmgen MOE and add Bf16 trtllmgen moe by jiahanc · Pull Request #2014 · flashinfer-ai/flashinfer) from @jiahanc that closes more gaps. Having this default also enables Expert Parallel in the default path for FP8 in addition to performance
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
26f3005 to
903430c
Compare
…misc fixes (vllm-project#27439) Signed-off-by: Pavani Majety <pmajety@nvidia.com>
…misc fixes (vllm-project#27439) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…misc fixes (vllm-project#27439) Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Purpose
This PR switches the default MOE backend to use Flashinfer TRTLLM MOE kernels which are optimized for the latency scenarios.
Additionally, I address a few more issues -
Fixes: #26070
Test Plan
Test Nemotron for accuracy and llama 3 70B FP4 for accuracy.
Test Result
nvidia/NVIDIA-Nemotron-Nano-9B-v2:
Original:
Results: PR (zero intialized in the kernel)
Llama 70B FP4 + TP2
original (torch.zeros)
With PR:
Performance
Server:
With zeros TOT main:
with torch.empty + TOT main
PR with fix:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.