-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[BugFix]Fix the issue where there is no parallelism in PP mode #28286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a throttling mechanism for scheduling new requests in pipeline parallelism (PP) mode. The goal is to prevent the pipeline from stalling by not exhausting the waiting queue at once, thus maintaining a backlog of requests. The logic seems sound for this purpose. However, I've identified a potential critical issue where a ZeroDivisionError could occur if pipeline_parallel_size is misconfigured to be zero. I've provided a suggestion to make the code more robust by adding a check for this.
| if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max( | ||
| 1, | ||
| self.max_num_running_reqs | ||
| // self.parallel_config.pipeline_parallel_size, | ||
| ): | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The division by self.parallel_config.pipeline_parallel_size on line 361 assumes this value is non-zero. If it were misconfigured to be 0, this would lead to a ZeroDivisionError, causing the scheduler to crash. While pipeline_parallel_size defaults to 1, adding a check to ensure it's positive would make the code more robust against configuration errors. The suggested change also improves readability by extracting the limit into a variable.
| if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max( | |
| 1, | |
| self.max_num_running_reqs | |
| // self.parallel_config.pipeline_parallel_size, | |
| ): | |
| break | |
| pp_size = self.parallel_config.pipeline_parallel_size | |
| if pp_size <= 0: | |
| raise ValueError( | |
| "pipeline_parallel_size must be positive, but is " | |
| f"{pp_size}") | |
| limit = max(1, self.max_num_running_reqs // pp_size) | |
| if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= limit: | |
| break |
5cd2377 to
7a26c8e
Compare
|
Thanks @weireweire. I've made a complete fix in #28768. However, this doesn't include your change to the scheduler:
If I understand correctly, this is an orthogonal optimization? Perhaps you could open another PR with just that change for consideration? |
|
@njhill yes, it's orthogonal. I'll revert the overlap fix and only keep the fix of "run out of request" here. |
|
This pull request has merge conflicts that must be resolved before it can be |
7a26c8e to
c51d37d
Compare
|
@WoosukKwon could you help to review? |
… in ray backend. Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>
c51d37d to
e8a900b
Compare
Purpose
--max-num-seqsall at once. Which cause the latter steps has no request to run and just waiting.--max-num-batched-tokensto avoid reach--max-num-seqsat once, but it only work for prefill--max-num-seqs / PP_sizeto achieve split micro batch.fix undesired blocking in engine core to fix the missing parallel for PP in mp backend.Test Plan
I tested on SM120 with PP8, the command is:
vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --host 0.0.0.0 --port 8000 --pipeline-parallel-size 8 --tensor-parallel-size 1 --max-num-seqs 8 --max-cudagraph-capture-size 8 --max-model-len 4042 --max-num-batched-tokens 32000 --enable-chunked-prefill --kv-cache-dtype auto --gpu-memory-utilization 0.85 --no-enable-prefix-cachingAnd use
--distributed-executor-backendto choose mp or ray backendbench command:
vllm bench serve --model nvidia/DeepSeek-R1-0528-FP4-v2 --host 127.0.0.1 --port 8000 --dataset-name random --max-concurrency 8 --num-prompts 256 --random-input-len 4000 --random-output-len 32 --random-range-ratio 0 --num-warmups 20Test Result
mp backend, before this PR:
mp backend, after this PR:
ray backend, before this pr:
ray backend, after this pr:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.