Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Nov 15, 2025

This fixes two pipeline-parallel related issues introduced by #26866:

  1. Performance regression reported in [Bug]: Pipeline parallel doesn't really do the "parallel" among GPUs. #28270
  2. KV connector output propagation/aggregation from workers

The first commit just reverts the granular tracing context managers added in #28329 which make the code very difficult to read/update. Please just look at the changes in the second commit which are smaller/cleaner.

Fixes #28270

Thanks to @weireweire for reporting.

@mergify mergify bot added the v1 label Nov 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two main changes: a performance improvement by removing granular tracing context managers, and a fix for KV connector output propagation in pipeline parallel setups. The removal of record_function_or_nullcontext should improve performance and code readability. The refactoring of the pipeline parallelism logic in step_with_batch_queue to be more asynchronous is also a good improvement. However, I have a critical concern regarding the new mechanism for propagating kv_connector_output from non-final pipeline parallel ranks. The logic has been moved from execute_model to sample_tokens, but it seems sample_tokens is only executed on the final rank. This could lead to the loss of kv_connector_output from other ranks, breaking the feature. Please see my detailed comment.

@njhill njhill added the bug Something isn't working label Nov 15, 2025
@njhill njhill added this to the v0.11.1 milestone Nov 15, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 15, 2025
njhill and others added 2 commits November 15, 2025 10:22
Signed-off-by: Nick Hill <nhill@redhat.com>
@nvpohanh
Copy link
Contributor

@weireweire please check if this works. Thanks!

@weireweire
Copy link
Contributor

weireweire commented Nov 17, 2025

Have you tested lm_eval,I also found PP have accuracy issue, not sure if it's related to #26866

@weireweire
Copy link
Contributor

weireweire commented Nov 17, 2025

From my test, this fixed the overlap issue but the accuracy is still bad.

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |  0.2|±  |0.1333|
|     |       |strict-match    |     5|exact_match|↑  |  0.2|±  |0.1333|

My command:
vllm serve nvidia/DeepSeek-R1-0528-FP4-v2 --trust-remote-code --host 0.0.0.0 --port 8000 --pipeline-parallel-size 8 --tensor-parallel-size 1 --max-num-seqs 32 --max-cudagraph-capture-size 32 --max-model-len 4010 --max-num-batched-tokens 16000 --enable-chunked-prefill --kv-cache-dtype auto --gpu-memory-utilization 0.85 --no-enable-prefix-caching

lm_eval \
--model local-completions \
--tasks gsm8k \
--model_args base_url=http://0.0.0.0:8000/v1/completions,model=$MODEL,num_concurrent=$CONCURRENCY,timeout=6000,max_retries=1 \
--output_path "$LOG_DIR" \
--log_samples \
--limit 10 \

This result is same as my draft fix , must be somewhere else that caused the accuracy issue.

@weireweire
Copy link
Contributor

accuracy issue tracked #28839

@njhill
Copy link
Member Author

njhill commented Nov 17, 2025

Thanks @weireweire, I think the accuracy issue is probably separate to this, we'll investigate that too.

logger.info("Batch queue is enabled with size %d", self.batch_queue_size)
self.batch_queue = deque(maxlen=self.batch_queue_size)

self.ec_producer = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
self.ec_producer = (
self.is_ec_producer = (

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open follow-on PR so that this doesn't hold up the release

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WoosukKwon opened follow-on #28884 for this.

@njhill njhill merged commit 7765e5b into vllm-project:main Nov 17, 2025
44 checks passed
@njhill njhill deleted the fix-broken-pp branch November 17, 2025 22:08
khluu pushed a commit that referenced this pull request Nov 17, 2025
…8768)

Signed-off-by: Nick Hill <nhill@redhat.com>
(cherry picked from commit 7765e5b)
Victor49152 pushed a commit to Victor49152/vllm that referenced this pull request Nov 20, 2025
bigPYJ1151 pushed a commit that referenced this pull request Nov 25, 2025
…8768)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Pipeline parallel doesn't really do the "parallel" among GPUs.

5 participants