-
Notifications
You must be signed in to change notification settings - Fork 78
[GAUDISW-241080] enable spec decode for Unified Attention #619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables speculative decoding support for Unified Attention in vLLM on Gaudi. The changes integrate spec decode metadata handling and rejection sampling into the unified execution path, along with test infrastructure updates.
Key Changes:
- Added spec decode metadata handling and rejection sampling to the unified execution model path
- Refactored
propose_draft_token_idsmethod signature to support both unified and non-unified attention paths - Added test cases for spec decode with ngram and eagle3 using Unified Attention
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_model_runner.py | Integrated spec decode sampling logic into unified execution path and refactored method signatures to support optional parameters |
| vllm_gaudi/extension/unified_batch.py | Added spec_decode_metadata field to UnifiedBatch dataclass and debug print statement |
| tests/full_tests/spec_decode.py | Commented out VLLM_CONTIGUOUS_PA environment variable setting |
| tests/full_tests/ci_gsm8k_tests.sh | Added test functions for spec decode with ngram and eagle3 using Unified Attention |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ae733c8 to
58b9c42
Compare
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
58b9c42 to
4950075
Compare
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
b045f93 to
6273d70
Compare
|
@kzawora-intel @adobrzyn . I completed first PR for supporting spec decode for unified attention, only enabled ngram to minimum code change. Please help to review |
✅ CI PassedAll checks passed successfully against the following vllm commit: |
6056d95 to
63b4213
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1806ff1 to
eee83a1
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
eee83a1 to
c7d4225
Compare
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
✅ CI PassedAll checks passed successfully against the following vllm commit: |
tests/full_tests/spec_decode.py
Outdated
|
|
||
| os.environ["VLLM_SKIP_WARMUP"] = "true" | ||
| os.environ["VLLM_CONTIGUOUS_PA"] = "false" | ||
| #os.environ["VLLM_CONTIGUOUS_PA"] = "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that during rebase
adobrzyn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
f7b62a9 to
6e587b0
Compare
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
✅ CI PassedAll checks passed successfully against the following vllm commit: |
SW-241080
Current status:
UA + spec_decode NGRAM => Done => accuracy verified
UA + spec_decode eagle3 => Done => accuracy verified
design doc:
For non-UA, we will pad target_model input to fix token_ids shape to limit potential hpu graph possibility
For UA, we can use actual draft token to avoid redundant padding => follow very similar design as GPU does
Which is to say, For UA spec decode:
workflow:
== start step ==
input(contains prompt, no draft) => target_model => regular sampling => update states => draft model => update draft token for next step
== next step ==
input(contains draft tokens) => target_model (use sharable attn with multiple tokens in one req - token + draft tokens) => rejection sampler (verify draft tokens to get final validated sampled tokens) => update states => draft model (to get new draft tokens, we need to skip any tokens rejected by target model) => update draft token for next step
Example:
input is
input with draft is [[in your] [is] [nice tool] [name]]
=> output from target model is [[your mind] [an] [way used] [name]]
=> after rejection sampler [[your mind] [an] [way] [name]] // only accept bonus token when draft accepeted
=> input to draft model [[mind] [an] [way] [name]] // notice, we need to create new attn_meta or reuse existing but calfully select output indices.
Changes introduced in this PR
create_unified_batchunified_execute_modelUPDATE REQUEST STATE part, so draft token can be picked by schedulerpropose_draft_token_idsso we can make several arguments with default values_prepare_spec_decode_inputs_for_uafor Unified Attention preparationValidation: