[GAUDISW-241080] enable spec decode for Unified Attention #619

xuechendi · 2025-11-21T23:29:52Z

SW-241080

Current status:
UA + spec_decode NGRAM => Done => accuracy verified
UA + spec_decode eagle3 => Done => accuracy verified

design doc:
For non-UA, we will pad target_model input to fix token_ids shape to limit potential hpu graph possibility
For UA, we can use actual draft token to avoid redundant padding => follow very similar design as GPU does

Which is to say, For UA spec decode:

we skip spec decode for target model if non_draft_tokens generated from last run
with draft token, do not pad input_token_ids, use target_token_indices and bonus_token_indices to indicate tokens for reject sampler to judge
However, for inupt to draft model, we will reuse attn_metadata from target model(as initial impl) => update meta to remove rejected token will be next step.

# Example:
      # scheduled_spec_decode_tokens={'0': [-1], '1': [-1], '2': [17689], '3': [-1]} => only 3rd request has draft token
      # token_ids = [[tok_0], [tok_1], [tok_2, draft_tok], [tok_4]]
      # draft_token_indices = [0, 0, 3, 0] => pos of token_ids for compare to target model output
      # target_token_indices = [-1, -1, 2, -1] => -1 is place holder, only verify pos==2 of target model output
      # bonus_token_indices = [0, 1, 3, 4] => new generated token from target model

      # current design for draft model fwd
      # say if target token gets verified by target model
      # => last token indices to select draft token from draft model is [0, 1, 3, 4]
      # say if target token gets rejected
      # => last token indices select draft token from from draft model is [0, 1, 2, 4]

workflow:
== start step ==
input(contains prompt, no draft) => target_model => regular sampling => update states => draft model => update draft token for next step

== next step ==
input(contains draft tokens) => target_model (use sharable attn with multiple tokens in one req - token + draft tokens) => rejection sampler (verify draft tokens to get final validated sampled tokens) => update states => draft model (to get new draft tokens, we need to skip any tokens rejected by target model) => update draft token for next step
Example:
input is
input with draft is [[in your] [is] [nice tool] [name]]
=> output from target model is [[your mind] [an] [way used] [name]]
=> after rejection sampler [[your mind] [an] [way] [name]] // only accept bonus token when draft accepeted
=> input to draft model [[mind] [an] [way] [name]] // notice, we need to create new attn_meta or reuse existing but calfully select output indices.

Changes introduced in this PR

add new arg for spec decode to create_unified_batch
update unified_execute_model UPDATE REQUEST STATE part, so draft token can be picked by scheduler
shift parameters propose_draft_token_ids so we can make several arguments with default values
implement new _prepare_spec_decode_inputs_for_ua for Unified Attention preparation
Add new propose_eagle_unified with new proposal file

Validation:

VLLM_UNIFIED_ATTN=True VLLM_SKIP_WARMUP=True PT_HPU_LAZY_MODE=1 python "${VLLM_GAUDI_PREFIX}/tests/full_tests/spec_decode.py" --task ngram --assert_acc_rate 0.25 --osl 512
and 
VLLM_UNIFIED_ATTN=True VLLM_SKIP_WARMUP=True PT_HPU_LAZY_MODE=1 python "${VLLM_GAUDI_PREFIX}/tests/full_tests/spec_decode.py" --task eagle3 --assert_accept_rate 0.50 --osl 1024

================= spec_ngram =================
latency: 46.99283313751221
acc_counts: [1742, 0]
acc_rate: 0.27142411966344654
num_draft_tokens: 6418
num_drafts: 6418
---
Prompt: Hello, my name is
Generated text:  Xiaoyu, and I'm a student at the University of Science and Technology of China. I'm currently studying in the Department of Physics. I'm in my second year, and I'm majoring in physics. I'm interested'...'
---
Prompt: The president of the United States is
Generated text:  the head of state and government of the United States. The president is the head of the executive branch of the U.S. government, and is the commander-in-chief of the United States Armed Forces. The p'...'
---
Prompt: The capital of France is
Generated text:  Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of Portugal is Lisbon. The capital of Greece is Athens. The capital of Belgium is Br'...'
---
Prompt: The future of AI is
Generated text:  a topic that has been the subject of much speculation and debate. As the technology continues to evolve, it's clear that AI will play an increasingly important role in our lives. However, the questio'...'
---
Prompt: San Francisco is know for its
Generated text:  fog, but the fog is not the only thing that is fog-like. The city is also known for its fog, but the fog is not the only thing that is fog-like. The city is also known for its fog, but the fog is not'...'
---
Prompt: Facebook was created in 2004 by
Generated text:  Mark Zuckerberg, and it has become one of the most popular social media platforms. It is a social networking site that allows users to connect with friends and family, share photos and videos, and po'...'
---
Prompt: Curious George is a
Generated text:  2015 American 3D computer-animated comedy film directed by Tom McCamus and written by David W. Zucker, and starring the titular character, Curious George, voiced by the actor and comedian Will Ferrel'...'
---
Prompt: Python 3.11 brings improvements to its
Generated text:  standard library, including the `typing` module. One of the notable changes is the introduction of the `TypeAlias` feature, which allows for the creation of type aliases in a more readable and concis'...'
=========================================

Copilot

Pull request overview

This PR enables speculative decoding support for Unified Attention in vLLM on Gaudi. The changes integrate spec decode metadata handling and rejection sampling into the unified execution path, along with test infrastructure updates.

Key Changes:

Added spec decode metadata handling and rejection sampling to the unified execution model path
Refactored propose_draft_token_ids method signature to support both unified and non-unified attention paths
Added test cases for spec decode with ngram and eagle3 using Unified Attention

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
vllm_gaudi/v1/worker/hpu_model_runner.py	Integrated spec decode sampling logic into unified execution path and refactored method signatures to support optional parameters
vllm_gaudi/extension/unified_batch.py	Added spec_decode_metadata field to UnifiedBatch dataclass and debug print statement
tests/full_tests/spec_decode.py	Commented out VLLM_CONTIGUOUS_PA environment variable setting
tests/full_tests/ci_gsm8k_tests.sh	Added test functions for spec decode with ngram and eagle3 using Unified Attention

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm_gaudi/extension/unified_batch.py

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi · 2025-11-26T22:23:21Z

@kzawora-intel @adobrzyn . I completed first PR for supporting spec decode for unified attention, only enabled ngram to minimum code change. Please help to review

github-actions · 2025-11-26T23:55:19Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

github-actions · 2025-12-01T22:21:28Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

github-actions · 2025-12-02T17:44:39Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2025-12-02T19:03:15Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
52cb349fc010c3d9e8f576f7cc675e6403aadd0a

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

github-actions · 2025-12-03T22:23:10Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
5aa9b090407d5fb9b89c05d28fab808623e3070c

adobrzyn · 2025-12-04T15:37:03Z

tests/full_tests/spec_decode.py


 os.environ["VLLM_SKIP_WARMUP"] = "true"
-os.environ["VLLM_CONTIGUOUS_PA"] = "false"
+#os.environ["VLLM_CONTIGUOUS_PA"] = "false"


Can we remove it?

Sure, I can do that during rebase

adobrzyn

lgtm

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

github-actions · 2025-12-04T17:58:29Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2025-12-04T20:15:28Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
1b7c7f5159484063af28cb47809d79e83d3301ec

Copilot AI review requested due to automatic review settings November 21, 2025 23:29

Copilot AI reviewed Nov 21, 2025

View reviewed changes

vllm_gaudi/extension/unified_batch.py Outdated Show resolved Hide resolved

vllm_gaudi/extension/unified_batch.py Outdated Show resolved Hide resolved

xuechendi force-pushed the dev/UA_spec_decode branch 7 times, most recently from ae733c8 to 58b9c42 Compare November 26, 2025 21:32

Enable UA for ngram

4950075

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/UA_spec_decode branch from 58b9c42 to 4950075 Compare November 26, 2025 21:33

xuechendi added 2 commits November 26, 2025 21:51

Merge remote-tracking branch 'origin/main' into dev/spec_decode_ua_ngram

f1c72a1

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

clean up and add dev doc

6273d70

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/UA_spec_decode branch from b045f93 to 6273d70 Compare November 26, 2025 22:21

xuechendi marked this pull request as ready for review November 26, 2025 22:22

xuechendi requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and vivekgoe as code owners November 26, 2025 22:22

vllm-project deleted a comment from github-actions bot Nov 26, 2025

xuechendi changed the title ~~enable spec decode for Unified Attention~~ enable spec decode for Unified Attention, part1 Nov 26, 2025

Merge remote-tracking branch 'origin/main' into pr619_1201

63b4213

xuechendi force-pushed the dev/UA_spec_decode branch from 6056d95 to 63b4213 Compare December 1, 2025 21:14

xuechendi changed the title ~~enable spec decode for Unified Attention, part1~~ [GAUDISW-241080] enable spec decode for Unified Attention, part1 Dec 2, 2025

xuechendi added 2 commits December 2, 2025 15:39

Merge remote-tracking branch 'origin/main' into pr619+eagle

f9e8947

fix test

c29d71b

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/UA_spec_decode branch from 1806ff1 to eee83a1 Compare December 2, 2025 17:56

xuechendi added 5 commits December 3, 2025 20:05

Enable eagle3 in naive way

fc1fca5

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Merge remote-tracking branch 'origin/main' into pr619+eagle

eecde21

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

update UT

b3b2dee

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

Merge remote-tracking branch 'origin/main' into pr619+eagle

d1a0c62

update osl to 1024 for UA+eagle3

c7d4225

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/UA_spec_decode branch from eee83a1 to c7d4225 Compare December 3, 2025 20:52

xuechendi changed the title ~~[GAUDISW-241080] enable spec decode for Unified Attention, part1~~ [GAUDISW-241080] enable spec decode for Unified Attention Dec 3, 2025

add hpu_eagle_unified.py

fb7d7d1

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

adobrzyn reviewed Dec 4, 2025

View reviewed changes

adobrzyn approved these changes Dec 4, 2025

View reviewed changes

xuechendi added 2 commits December 4, 2025 16:57

Merge remote-tracking branch 'origin/main' into pr619+eagle

1abc17c

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

remove commented codes

6e587b0

Signed-off-by: Chendi Xue <chendi.xue@intel.com>

xuechendi force-pushed the dev/UA_spec_decode branch 2 times, most recently from f7b62a9 to 6e587b0 Compare December 4, 2025 17:02

Merge branch 'main' into dev/UA_spec_decode

49631d7

xuechendi merged commit 0669088 into vllm-project:main Dec 4, 2025
47 checks passed

[GAUDISW-241080] enable spec decode for Unified Attention #619

[GAUDISW-241080] enable spec decode for Unified Attention #619

Uh oh!

Conversation

xuechendi commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

xuechendi commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

✅ CI Passed

Uh oh!

github-actions bot commented Dec 1, 2025

✅ CI Passed

Uh oh!

github-actions bot commented Dec 2, 2025

🚧 CI Blocked

Uh oh!

github-actions bot commented Dec 2, 2025

✅ CI Passed

Uh oh!

github-actions bot commented Dec 3, 2025

✅ CI Passed

Uh oh!

adobrzyn Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

xuechendi Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

adobrzyn left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 4, 2025

🚧 CI Blocked

Uh oh!

github-actions bot commented Dec 4, 2025

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xuechendi commented Nov 21, 2025 •

edited

Loading