-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
add support for --fully-sharded-loras in fused_moe #28761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: gnovack <gnovack@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for fully sharded LoRA adapters in FusedMoE, which is a great feature for reducing memory usage with tensor parallelism. The changes in the test files and Punica wrappers look good. However, I've found a critical issue in the implementation within vllm/lora/layers/fused_moe.py that will lead to incorrect results when using fully sharded LoRAs with tensor parallelism. The logic for aggregating the partial LoRA results for the w2 layer is missing an all-reduce operation, and it also attempts to add to an uninitialized tensor. I've provided a detailed comment and a suggested fix for this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| a_intermediate_cache1 | ||
| ) | ||
| else: | ||
| a_intermediate_cache1 = tensor_model_parallel_all_gather( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Is PDL compatible with fully_sharded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll test this out today and let you know. my assumption is that PDL probably won't work here due to the collective comms between shrink and expand, but i will confirm this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what i can tell based on profiling results, it seems that PDL does not provide benefit when fully_sharded is enabled, as the communication op between shrink and expand is not overlapped with either one (see below):

However, even when fully_sharded is disabled, it does not look like the moe shrink/expand calls are getting overlapped (seems this is because of the torch.zeros call before the expand kernel):

Let me know if there is some other/better test i can run to see if PDL is working as expected before/after my changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've observed similar behavior as well. At least DPL didn't cause any undefined behavior. I'll take some more time to look into it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeejeelee !
Signed-off-by: gnovack <gnovack@amazon.com>
| ) | ||
| for i in range(num_slices): | ||
| output[:, :, i * N : (i + 1) * N] += b_intermediate_cache1[i] | ||
| output[:, :, i * N + offset : (i + 1) * N + offset] += b_intermediate_cache1[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we can fuse this add into the expand kernel, so we don't need to create b_intermediate_cache1 explicitly.
This should save the overhead of 2 kernels (empty and add). We can complete this in a follow-up PR.
Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Bhagyashri <Bhagyashri.Gaikwad2@ibm.com>
Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: LuminolT <lumischen01@gmail.com>
Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Purpose
This PR adds support for
--fully-sharded-lorasto MoE LoRA adapters to allow S-LoRA style sharding of adapter weights. This reduces the amount of GPU memory required per rank when using LoRA w/ tensor parallelism.Example using Qwen3-30B
Before PR:
After PR:
Test Plan
fully_sharded_lorascase to existing MoE LoRA TP test cases to validate that the model output remains thesame when enabling this flag.
test_fused_moe_lora_kernelto validate that kernel outputs matches before and after this change