[BugFix]Fix the issue where there is no parallelism in PP mode #28286

weireweire · 2025-11-07T10:05:33Z

Purpose

Avoid run out of request at one time in PP mode which cause the PP mode won't be parallel.
- for pipeline parallel, we should split the batch to micro batch to make each rank have work to do. But currently, we didn't split and send --max-num-seqs all at once. Which cause the latter steps has no request to run and just waiting.
- A WAR is to use --max-num-batched-tokens to avoid reach --max-num-seqs at once, but it only work for prefill
- This fix simply limit the max issued sequence to --max-num-seqs / PP_size to achieve split micro batch.
~~fix undesired blocking in engine core to fix the missing parallel for PP in mp backend.~~
- We wait execute_model in the batch issue loop, even though it's async but it will cause wait the previous sample token and blocking following issue. move all logic to subprocess and avoid waiting can solve.
- this fix was added to [BugFix] Fix PP performance and PP kv connector output regression #28768

Test Plan

I tested on SM120 with PP8, the command is:

vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --host 0.0.0.0 --port 8000 --pipeline-parallel-size 8 --tensor-parallel-size 1 --max-num-seqs 8 --max-cudagraph-capture-size 8 --max-model-len 4042 --max-num-batched-tokens 32000 --enable-chunked-prefill --kv-cache-dtype auto --gpu-memory-utilization 0.85 --no-enable-prefix-caching

And use --distributed-executor-backend to choose mp or ray backend

bench command:
vllm bench serve --model nvidia/DeepSeek-R1-0528-FP4-v2 --host 127.0.0.1 --port 8000 --dataset-name random --max-concurrency 8 --num-prompts 256 --random-input-len 4000 --random-output-len 32 --random-range-ratio 0 --num-warmups 20

Test Result

mp backend, before this PR:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  402.84    
Total input tokens:                      1023744   
Total generated tokens:                  8192      
Request throughput (req/s):              0.64      
Output token throughput (tok/s):         20.34     
Peak output token throughput (tok/s):    96.00     
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          2561.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          8007.14   
Median TTFT (ms):                        9785.40   
P99 TTFT (ms):                           9817.62   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          147.76    
Median TPOT (ms):                        91.55     
P99 TPOT (ms):                           324.47    
---------------Inter-token Latency----------------
Mean ITL (ms):                           147.76    
Median ITL (ms):                         90.32     
P99 ITL (ms):                            100.31    
==================================================

mp backend, after this PR:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  158.89    
Total input tokens:                      1023744   
Total generated tokens:                  8192      
Request throughput (req/s):              1.61      
Output token throughput (tok/s):         51.56     
Peak output token throughput (tok/s):    176.00    
Peak concurrent requests:                15.00     
Total Token throughput (tok/s):          6494.74   
---------------Time to First Token----------------
Mean TTFT (ms):                          2015.66   
Median TTFT (ms):                        2041.88   
P99 TTFT (ms):                           3430.10   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          95.11     
Median TPOT (ms):                        94.66     
P99 TPOT (ms):                           131.28    
---------------Inter-token Latency----------------
Mean ITL (ms):                           95.11     
Median ITL (ms):                         52.02     
P99 ITL (ms):                            1222.23   
==================================================

ray backend, before this pr:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  412.88    
Total input tokens:                      1023744   
Total generated tokens:                  8192      
Request throughput (req/s):              0.62      
Output token throughput (tok/s):         19.84     
Peak output token throughput (tok/s):    80.00     
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          2499.34   
---------------Time to First Token----------------
Mean TTFT (ms):                          7172.32   
Median TTFT (ms):                        6685.45   
P99 TTFT (ms):                           9573.30   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          184.80    
Median TPOT (ms):                        184.27    
P99 TPOT (ms):                           265.55    
---------------Inter-token Latency----------------
Mean ITL (ms):                           184.80    
Median ITL (ms):                         109.89    
P99 ITL (ms):                            4795.00   
==================================================

ray backend, after this pr:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  178.30    
Total input tokens:                      1023744   
Total generated tokens:                  8192      
Request throughput (req/s):              1.44      
Output token throughput (tok/s):         45.95     
Peak output token throughput (tok/s):    136.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          5787.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          2318.94   
Median TTFT (ms):                        2170.65   
P99 TTFT (ms):                           3270.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          104.89    
Median TPOT (ms):                        109.21    
P99 TPOT (ms):                           162.90    
---------------Inter-token Latency----------------
Mean ITL (ms):                           104.89    
Median ITL (ms):                         60.14     
P99 ITL (ms):                            1416.10   
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces a throttling mechanism for scheduling new requests in pipeline parallelism (PP) mode. The goal is to prevent the pipeline from stalling by not exhausting the waiting queue at once, thus maintaining a backlog of requests. The logic seems sound for this purpose. However, I've identified a potential critical issue where a ZeroDivisionError could occur if pipeline_parallel_size is misconfigured to be zero. I've provided a suggestion to make the code more robust by adding a check for this.

gemini-code-assist · 2025-11-07T10:07:06Z

vllm/v1/core/sched/scheduler.py

+                if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max(
+                    1,
+                    self.max_num_running_reqs
+                    // self.parallel_config.pipeline_parallel_size,
+                ):
+                    break


The division by self.parallel_config.pipeline_parallel_size on line 361 assumes this value is non-zero. If it were misconfigured to be 0, this would lead to a ZeroDivisionError, causing the scheduler to crash. While pipeline_parallel_size defaults to 1, adding a check to ensure it's positive would make the code more robust against configuration errors. The suggested change also improves readability by extracting the limit into a variable.

Suggested change

if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max(

1,

self.max_num_running_reqs

// self.parallel_config.pipeline_parallel_size,

):

break

pp_size = self.parallel_config.pipeline_parallel_size

if pp_size <= 0:

raise ValueError(

"pipeline_parallel_size must be positive, but is "

f"{pp_size}")

limit = max(1, self.max_num_running_reqs // pp_size)

if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= limit:

break

heheda12345 · 2025-11-11T07:51:56Z

CC @njhill as it may be related to async scheduling as mentioned here #26866

njhill · 2025-11-15T17:05:37Z

Thanks @weireweire. I've made a complete fix in #28768.

However, this doesn't include your change to the scheduler:

Avoid run out of request at one time in PP mode which cause the PP mode won't be parallel.

If I understand correctly, this is an orthogonal optimization? Perhaps you could open another PR with just that change for consideration?

weireweire · 2025-11-17T05:25:01Z

@njhill yes, it's orthogonal. I'll revert the overlap fix and only keep the fix of "run out of request" here.

mergify · 2025-11-18T03:53:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @weireweire.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

weireweire · 2025-11-18T03:56:17Z

@WoosukKwon could you help to review?

… in ray backend. Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>

weireweire requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners November 7, 2025 10:05

mergify bot added the v1 label Nov 7, 2025

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

weireweire mentioned this pull request Nov 7, 2025

[Bug]: Pipeline parallel doesn't really do the "parallel" among GPUs. #28270

Closed

1 task

weireweire force-pushed the fix-pp-parallel branch from 5cd2377 to 7a26c8e Compare November 14, 2025 02:35

weireweire mentioned this pull request Nov 14, 2025

[Core] Async scheduling + structured outputs compatibility #26866

Merged

weireweire mentioned this pull request Nov 17, 2025

[BugFix] Fix PP performance and PP kv connector output regression #28768

Merged

weireweire changed the title ~~[Draft]Fix the issue where there is no parallelism in PP mode~~ [BugFix]Fix the issue where there is no parallelism in PP mode Nov 18, 2025

mergify bot added the needs-rebase label Nov 18, 2025

weireweire force-pushed the fix-pp-parallel branch from 7a26c8e to c51d37d Compare November 18, 2025 03:54

mergify bot removed the needs-rebase label Nov 18, 2025

Fix the issue that run out of requests make no parallelism in PP mode…

e8a900b

… in ray backend. Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>

weireweire force-pushed the fix-pp-parallel branch from c51d37d to e8a900b Compare November 18, 2025 03:57

njhill mentioned this pull request Nov 18, 2025

[Bug]: get wrong output in lm_eval test for PP mode. #28839

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix]Fix the issue where there is no parallelism in PP mode #28286

[BugFix]Fix the issue where there is no parallelism in PP mode #28286

Uh oh!

weireweire commented Nov 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

heheda12345 commented Nov 11, 2025

Uh oh!

njhill commented Nov 15, 2025

Uh oh!

weireweire commented Nov 17, 2025

Uh oh!

mergify bot commented Nov 18, 2025

Uh oh!

weireweire commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max(
-,
-                    self.max_num_running_reqs
-                    // self.parallel_config.pipeline_parallel_size,
-                ):
-                    break
+                pp_size = self.parallel_config.pipeline_parallel_size
+                if pp_size <= 0:
+                    raise ValueError(
+                        "pipeline_parallel_size must be positive, but is "
+                        f"{pp_size}")
+                limit = max(1, self.max_num_running_reqs // pp_size)
+                if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= limit:
+                    break

Uh oh!

[BugFix]Fix the issue where there is no parallelism in PP mode #28286

Are you sure you want to change the base?

[BugFix]Fix the issue where there is no parallelism in PP mode #28286

Uh oh!

Conversation

weireweire commented Nov 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Nov 11, 2025

Uh oh!

njhill commented Nov 15, 2025

Uh oh!

weireweire commented Nov 17, 2025

Uh oh!

mergify bot commented Nov 18, 2025

Uh oh!

weireweire commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weireweire commented Nov 7, 2025 •

edited by github-actions bot

Loading