Skip to content

Conversation

@xuechendi
Copy link
Collaborator

@xuechendi xuechendi commented Nov 21, 2025

SW-241080

Current status:
UA + spec_decode NGRAM => Done => accuracy verified
UA + spec_decode eagle3 => Done => accuracy verified

design doc:
For non-UA, we will pad target_model input to fix token_ids shape to limit potential hpu graph possibility
For UA, we can use actual draft token to avoid redundant padding => follow very similar design as GPU does

Which is to say, For UA spec decode:

  1. we skip spec decode for target model if non_draft_tokens generated from last run
  2. with draft token, do not pad input_token_ids, use target_token_indices and bonus_token_indices to indicate tokens for reject sampler to judge
  3. However, for inupt to draft model, we will reuse attn_metadata from target model(as initial impl) => update meta to remove rejected token will be next step.
# Example:
      # scheduled_spec_decode_tokens={'0': [-1], '1': [-1], '2': [17689], '3': [-1]} => only 3rd request has draft token
      # token_ids = [[tok_0], [tok_1], [tok_2, draft_tok], [tok_4]]
      # draft_token_indices = [0, 0, 3, 0] => pos of token_ids for compare to target model output
      # target_token_indices = [-1, -1, 2, -1] => -1 is place holder, only verify pos==2 of target model output
      # bonus_token_indices = [0, 1, 3, 4] => new generated token from target model

      # current design for draft model fwd
      # say if target token gets verified by target model
      # => last token indices to select draft token from draft model is [0, 1, 3, 4]
      # say if target token gets rejected
      # => last token indices select draft token from from draft model is [0, 1, 2, 4]     

workflow:
== start step ==
input(contains prompt, no draft) => target_model => regular sampling => update states => draft model => update draft token for next step

== next step ==
input(contains draft tokens) => target_model (use sharable attn with multiple tokens in one req - token + draft tokens) => rejection sampler (verify draft tokens to get final validated sampled tokens) => update states => draft model (to get new draft tokens, we need to skip any tokens rejected by target model) => update draft token for next step
Example:
input is
input with draft is [[in your] [is] [nice tool] [name]]
=> output from target model is [[your mind] [an] [way used] [name]]
=> after rejection sampler [[your mind] [an] [way] [name]] // only accept bonus token when draft accepeted
=> input to draft model [[mind] [an] [way] [name]] // notice, we need to create new attn_meta or reuse existing but calfully select output indices.

Changes introduced in this PR

  1. add new arg for spec decode to create_unified_batch
  2. update unified_execute_model UPDATE REQUEST STATE part, so draft token can be picked by scheduler
  3. shift parameters propose_draft_token_ids so we can make several arguments with default values
  4. implement new _prepare_spec_decode_inputs_for_ua for Unified Attention preparation
  5. Add new propose_eagle_unified with new proposal file

Validation:

VLLM_UNIFIED_ATTN=True VLLM_SKIP_WARMUP=True PT_HPU_LAZY_MODE=1 python "${VLLM_GAUDI_PREFIX}/tests/full_tests/spec_decode.py" --task ngram --assert_acc_rate 0.25 --osl 512
and 
VLLM_UNIFIED_ATTN=True VLLM_SKIP_WARMUP=True PT_HPU_LAZY_MODE=1 python "${VLLM_GAUDI_PREFIX}/tests/full_tests/spec_decode.py" --task eagle3 --assert_accept_rate 0.50 --osl 1024

================= spec_ngram =================
latency: 46.99283313751221
acc_counts: [1742, 0]
acc_rate: 0.27142411966344654
num_draft_tokens: 6418
num_drafts: 6418
---
Prompt: Hello, my name is
Generated text:  Xiaoyu, and I'm a student at the University of Science and Technology of China. I'm currently studying in the Department of Physics. I'm in my second year, and I'm majoring in physics. I'm interested'...'
---
Prompt: The president of the United States is
Generated text:  the head of state and government of the United States. The president is the head of the executive branch of the U.S. government, and is the commander-in-chief of the United States Armed Forces. The p'...'
---
Prompt: The capital of France is
Generated text:  Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of Portugal is Lisbon. The capital of Greece is Athens. The capital of Belgium is Br'...'
---
Prompt: The future of AI is
Generated text:  a topic that has been the subject of much speculation and debate. As the technology continues to evolve, it's clear that AI will play an increasingly important role in our lives. However, the questio'...'
---
Prompt: San Francisco is know for its
Generated text:  fog, but the fog is not the only thing that is fog-like. The city is also known for its fog, but the fog is not the only thing that is fog-like. The city is also known for its fog, but the fog is not'...'
---
Prompt: Facebook was created in 2004 by
Generated text:  Mark Zuckerberg, and it has become one of the most popular social media platforms. It is a social networking site that allows users to connect with friends and family, share photos and videos, and po'...'
---
Prompt: Curious George is a
Generated text:  2015 American 3D computer-animated comedy film directed by Tom McCamus and written by David W. Zucker, and starring the titular character, Curious George, voiced by the actor and comedian Will Ferrel'...'
---
Prompt: Python 3.11 brings improvements to its
Generated text:  standard library, including the `typing` module. One of the notable changes is the introduction of the `TypeAlias` feature, which allows for the creation of type aliases in a more readable and concis'...'
=========================================

Copilot AI review requested due to automatic review settings November 21, 2025 23:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables speculative decoding support for Unified Attention in vLLM on Gaudi. The changes integrate spec decode metadata handling and rejection sampling into the unified execution path, along with test infrastructure updates.

Key Changes:

  • Added spec decode metadata handling and rejection sampling to the unified execution model path
  • Refactored propose_draft_token_ids method signature to support both unified and non-unified attention paths
  • Added test cases for spec decode with ngram and eagle3 using Unified Attention

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
vllm_gaudi/v1/worker/hpu_model_runner.py Integrated spec decode sampling logic into unified execution path and refactored method signatures to support optional parameters
vllm_gaudi/extension/unified_batch.py Added spec_decode_metadata field to UnifiedBatch dataclass and debug print statement
tests/full_tests/spec_decode.py Commented out VLLM_CONTIGUOUS_PA environment variable setting
tests/full_tests/ci_gsm8k_tests.sh Added test functions for spec decode with ngram and eagle3 using Unified Attention

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@xuechendi xuechendi force-pushed the dev/UA_spec_decode branch 7 times, most recently from ae733c8 to 58b9c42 Compare November 26, 2025 21:32
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi
Copy link
Collaborator Author

@kzawora-intel @adobrzyn . I completed first PR for supporting spec decode for unified attention, only enabled ngram to minimum code change. Please help to review

@vllm-project vllm-project deleted a comment from github-actions bot Nov 26, 2025
@vllm-project vllm-project deleted a comment from github-actions bot Nov 26, 2025
@vllm-project vllm-project deleted a comment from github-actions bot Nov 26, 2025
@vllm-project vllm-project deleted a comment from github-actions bot Nov 26, 2025
@vllm-project vllm-project deleted a comment from github-actions bot Nov 26, 2025
@xuechendi xuechendi changed the title enable spec decode for Unified Attention enable spec decode for Unified Attention, part1 Nov 26, 2025
@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

@xuechendi xuechendi changed the title enable spec decode for Unified Attention, part1 [GAUDISW-241080] enable spec decode for Unified Attention, part1 Dec 2, 2025
@github-actions
Copy link

github-actions bot commented Dec 2, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
52cb349fc010c3d9e8f576f7cc675e6403aadd0a

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi changed the title [GAUDISW-241080] enable spec decode for Unified Attention, part1 [GAUDISW-241080] enable spec decode for Unified Attention Dec 3, 2025
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@github-actions
Copy link

github-actions bot commented Dec 3, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
5aa9b090407d5fb9b89c05d28fab808623e3070c


os.environ["VLLM_SKIP_WARMUP"] = "true"
os.environ["VLLM_CONTIGUOUS_PA"] = "false"
#os.environ["VLLM_CONTIGUOUS_PA"] = "false"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that during rebase

Copy link
Collaborator

@adobrzyn adobrzyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
@xuechendi xuechendi force-pushed the dev/UA_spec_decode branch 2 times, most recently from f7b62a9 to 6e587b0 Compare December 4, 2025 17:02
@github-actions
Copy link

github-actions bot commented Dec 4, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
1b7c7f5159484063af28cb47809d79e83d3301ec

@xuechendi xuechendi merged commit 0669088 into vllm-project:main Dec 4, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants