Skip to content

Conversation

@weireweire
Copy link
Contributor

@weireweire weireweire commented Nov 7, 2025

Purpose

  • Avoid run out of request at one time in PP mode which cause the PP mode won't be parallel.
    • for pipeline parallel, we should split the batch to micro batch to make each rank have work to do. But currently, we didn't split and send --max-num-seqs all at once. Which cause the latter steps has no request to run and just waiting.
    • A WAR is to use --max-num-batched-tokens to avoid reach --max-num-seqs at once, but it only work for prefill
    • This fix simply limit the max issued sequence to --max-num-seqs / PP_size to achieve split micro batch.
  • fix undesired blocking in engine core to fix the missing parallel for PP in mp backend.

Test Plan

I tested on SM120 with PP8, the command is:

vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --host 0.0.0.0 --port 8000 --pipeline-parallel-size 8 --tensor-parallel-size 1 --max-num-seqs 8 --max-cudagraph-capture-size 8 --max-model-len 4042 --max-num-batched-tokens 32000 --enable-chunked-prefill --kv-cache-dtype auto --gpu-memory-utilization 0.85 --no-enable-prefix-caching

And use --distributed-executor-backend to choose mp or ray backend

bench command:
vllm bench serve --model nvidia/DeepSeek-R1-0528-FP4-v2 --host 127.0.0.1 --port 8000 --dataset-name random --max-concurrency 8 --num-prompts 256 --random-input-len 4000 --random-output-len 32 --random-range-ratio 0 --num-warmups 20

Test Result

mp backend, before this PR:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  402.84    
Total input tokens:                      1023744   
Total generated tokens:                  8192      
Request throughput (req/s):              0.64      
Output token throughput (tok/s):         20.34     
Peak output token throughput (tok/s):    96.00     
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          2561.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          8007.14   
Median TTFT (ms):                        9785.40   
P99 TTFT (ms):                           9817.62   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          147.76    
Median TPOT (ms):                        91.55     
P99 TPOT (ms):                           324.47    
---------------Inter-token Latency----------------
Mean ITL (ms):                           147.76    
Median ITL (ms):                         90.32     
P99 ITL (ms):                            100.31    
==================================================

mp backend, after this PR:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  158.89    
Total input tokens:                      1023744   
Total generated tokens:                  8192      
Request throughput (req/s):              1.61      
Output token throughput (tok/s):         51.56     
Peak output token throughput (tok/s):    176.00    
Peak concurrent requests:                15.00     
Total Token throughput (tok/s):          6494.74   
---------------Time to First Token----------------
Mean TTFT (ms):                          2015.66   
Median TTFT (ms):                        2041.88   
P99 TTFT (ms):                           3430.10   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          95.11     
Median TPOT (ms):                        94.66     
P99 TPOT (ms):                           131.28    
---------------Inter-token Latency----------------
Mean ITL (ms):                           95.11     
Median ITL (ms):                         52.02     
P99 ITL (ms):                            1222.23   
==================================================

ray backend, before this pr:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  412.88    
Total input tokens:                      1023744   
Total generated tokens:                  8192      
Request throughput (req/s):              0.62      
Output token throughput (tok/s):         19.84     
Peak output token throughput (tok/s):    80.00     
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          2499.34   
---------------Time to First Token----------------
Mean TTFT (ms):                          7172.32   
Median TTFT (ms):                        6685.45   
P99 TTFT (ms):                           9573.30   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          184.80    
Median TPOT (ms):                        184.27    
P99 TPOT (ms):                           265.55    
---------------Inter-token Latency----------------
Mean ITL (ms):                           184.80    
Median ITL (ms):                         109.89    
P99 ITL (ms):                            4795.00   
==================================================

ray backend, after this pr:

============ Serving Benchmark Result ============
Successful requests:                     256       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  178.30    
Total input tokens:                      1023744   
Total generated tokens:                  8192      
Request throughput (req/s):              1.44      
Output token throughput (tok/s):         45.95     
Peak output token throughput (tok/s):    136.00    
Peak concurrent requests:                16.00     
Total Token throughput (tok/s):          5787.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          2318.94   
Median TTFT (ms):                        2170.65   
P99 TTFT (ms):                           3270.88   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          104.89    
Median TPOT (ms):                        109.21    
P99 TPOT (ms):                           162.90    
---------------Inter-token Latency----------------
Mean ITL (ms):                           104.89    
Median ITL (ms):                         60.14     
P99 ITL (ms):                            1416.10   
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a throttling mechanism for scheduling new requests in pipeline parallelism (PP) mode. The goal is to prevent the pipeline from stalling by not exhausting the waiting queue at once, thus maintaining a backlog of requests. The logic seems sound for this purpose. However, I've identified a potential critical issue where a ZeroDivisionError could occur if pipeline_parallel_size is misconfigured to be zero. I've provided a suggestion to make the code more robust by adding a check for this.

Comment on lines +358 to +405
if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max(
1,
self.max_num_running_reqs
// self.parallel_config.pipeline_parallel_size,
):
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The division by self.parallel_config.pipeline_parallel_size on line 361 assumes this value is non-zero. If it were misconfigured to be 0, this would lead to a ZeroDivisionError, causing the scheduler to crash. While pipeline_parallel_size defaults to 1, adding a check to ensure it's positive would make the code more robust against configuration errors. The suggested change also improves readability by extracting the limit into a variable.

Suggested change
if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= max(
1,
self.max_num_running_reqs
// self.parallel_config.pipeline_parallel_size,
):
break
pp_size = self.parallel_config.pipeline_parallel_size
if pp_size <= 0:
raise ValueError(
"pipeline_parallel_size must be positive, but is "
f"{pp_size}")
limit = max(1, self.max_num_running_reqs // pp_size)
if len(scheduled_resumed_reqs) + len(scheduled_new_reqs) >= limit:
break

@heheda12345
Copy link
Collaborator

CC @njhill as it may be related to async scheduling as mentioned here #26866

@njhill
Copy link
Member

njhill commented Nov 15, 2025

Thanks @weireweire. I've made a complete fix in #28768.

However, this doesn't include your change to the scheduler:

  • Avoid run out of request at one time in PP mode which cause the PP mode won't be parallel.

If I understand correctly, this is an orthogonal optimization? Perhaps you could open another PR with just that change for consideration?

@weireweire
Copy link
Contributor Author

@njhill yes, it's orthogonal. I'll revert the overlap fix and only keep the fix of "run out of request" here.

@weireweire weireweire changed the title [Draft]Fix the issue where there is no parallelism in PP mode [BugFix]Fix the issue where there is no parallelism in PP mode Nov 18, 2025
@mergify
Copy link

mergify bot commented Nov 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @weireweire.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@weireweire
Copy link
Contributor Author

@WoosukKwon could you help to review?

… in ray backend.

Signed-off-by: Weiliang Liu <weiliangl@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants