-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[BugFix] Fix PP/async scheduling with pooling models #28899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <nhill@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug related to pipeline parallelism and asynchronous scheduling for pooling models. The changes correctly prevent pooling models from going through the token sampling pipeline, which is intended only for generative models. The introduction of the is_pooling_model flag in vllm/v1/engine/core.py and the uses_sampler flag in vllm/v1/executor/ray_executor.py makes the code's intent clearer and fixes the incorrect behavior. The changes are logical, well-targeted, and appear to resolve the issue effectively. I have no further suggestions.
|
The two test failure here are unrelated:
|
Signed-off-by: Nick Hill <nhill@redhat.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: jiang1.li <jiang1.li@intel.com>
Broken by #28768. This wasn't exercised by the CI that ran on that PR.