Spec decode warmup support #624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

xuechendi merged 4 commits into vllm-project:main from jerrychenhf:spec-decode-warmup-support

Dec 3, 2025

Contributor

jerrychenhf commented Nov 25, 2025 •

edited by xuechendi

Loading

GAUDISW-242931

Because currently spec decode flatten the spec decode tokens into [batch_size * num_tokens, 1], we can warmup the decode shapes as it was. The thing changed is the maximum batch_size we should warmup in the configuration because the real batch size is batch_size * num_tokens which is num_tokens (1 + num_speculative_tokens) times of original batch size.

The thing to care in the warmup is the draft token (and block) space for the proposing process in eagle. We need to leave out the num_speculative_tokens space to use by propose for eagle.

Other care needs to be taken (already done in the PR of support num_speculative_tokens > 1) is warmup will be run in compile only mode without the real computation happening. So the operations for prepare_attn_metadata in the drafter which depends on the real position values must be done on CPU)

Another issue of handling no spec decode tokens for decode phase has already been handled #593

jerrychenhf requested review from adobrzyn, afierka-intel, iboiko-habana, kzawora-intel, mgawarkiewicz-intel, michalkuligowski, vivekgoe and xuechendi as code owners

November 25, 2025 05:39

github-actions bot commented Nov 25, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

jerrychenhf force-pushed the spec-decode-warmup-support branch from 2b997ab to dcd6a04 Compare

November 26, 2025 04:25

jerrychenhf requested review from kamil-kaczor and ksmusz as code owners

November 26, 2025 04:25

github-actions bot commented Nov 26, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

jerrychenhf force-pushed the spec-decode-warmup-support branch from dcd6a04 to 40beea3 Compare

December 1, 2025 02:25

github-actions bot commented Dec 1, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

Contributor Author

jerrychenhf commented Dec 1, 2025

@xuechendi Would you please have a review on this?

xuechendi reviewed

View reviewed changes

vllm_gaudi/v1/worker/hpu_model_runner.py Show resolved Hide resolved

xuechendi reviewed

View reviewed changes

vllm_gaudi/v1/worker/hpu_model_runner.py

    
                      # Leave space for the output token and draft tokens to propose

                      num_lookahead_tokens = 1

                      if self.speculative_config and self.speculative_config.use_eagle():

Collaborator

xuechendi Dec 1, 2025

same question here

xuechendi reviewed

View reviewed changes

vllm_gaudi/v1/worker/hpu_model_runner.py Show resolved Hide resolved

xuechendi reviewed

View reviewed changes

vllm_gaudi/v1/worker/hpu_model_runner.py

    
                          seq_len = (req_state.num_computed_tokens + scheduler_output.num_scheduled_tokens[req_id])

                          # Cannot use num_computed_tokens + num_scheduled_tokens here

                          # as it may include rejected spec decode tokens

                          seq_len = self.input_batch.num_tokens_no_spec[i]

Collaborator

xuechendi Dec 1, 2025

Good catch!

xuechendi reviewed

View reviewed changes

vllm_gaudi/v1/worker/hpu_model_runner.py Show resolved Hide resolved

jerrychenhf force-pushed the spec-decode-warmup-support branch from 5afdf31 to 01cf4ad Compare

December 2, 2025 07:42

Contributor Author

jerrychenhf commented Dec 2, 2025

Originally, I take the simplified approach that it's the user to configure the bucketing parameters properly to cover the shapes (batch_size, 1, context_blocks) for spec decode.
The new commit provide the solution in bucket manager to automatically generate possible new buckets for spec decode based on the buckets configured by user (seed buckets). This somewhat simplify the user configuration.

Please note: Current spec decode has some tricky things as to the bucketed batch size and context blocks:

We first get a padded_batch_size based on the num_decodes:

padded_batch_size = self.bucketing_manager.find_decode_bucket(num_decodes, sum(num_blocks))[0]

And then a virtual batch size is got from padded_batch_size * num_tokens. This is the real running batch size, which is not ganrateed to be covered by any bucketing (if there is no special handling).
And then bucketed context_blocks is find using the virtual batch size (but we used only the bucketed blocks but the bucketed batch size is not used, the virtual batch size = padded_batch_size * num_tokens is still used).

block_bucket_size = \
                self.bucketing_manager.find_decode_bucket(batch_size,
                                                          len(block_list))[2]

github-actions bot commented Dec 2, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

Collaborator

xuechendi commented Dec 2, 2025

@jerrychenhf , please add profiling to ticket to indicate the hpu graph have been captured for spec decode.

jerrychenhf added 3 commits

December 3, 2025 10:56


          Spec decode warmup support

ecbf008

Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>


          Automatically generate new buckets for spec decode based on seed buckets

e5f7747

Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>


          Fix formatting

d8dfe24

Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>

jerrychenhf force-pushed the spec-decode-warmup-support branch from c5a6068 to 5dc4522 Compare

December 3, 2025 02:56

Contributor Author

jerrychenhf commented Dec 3, 2025

Updated the test results to GAUDISW-242931. With warmup enabled with this PR, running "PT_HPU_LAZY_MODE=1 python tests/full_tests/spec_decode.py --task eagle3 --batch_size 8 --osl 2048--num_spec_tokens 2", no configurations were shown as "not warmup up".

jerrychenhf force-pushed the spec-decode-warmup-support branch from 5dc4522 to 5a66017 Compare

December 3, 2025 03:52


          Add info log for new generated spec decode buckets

7061d1b

Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>

jerrychenhf force-pushed the spec-decode-warmup-support branch from 5a66017 to 7061d1b Compare

December 3, 2025 06:23

github-actions bot commented Dec 3, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
d7284a2604ef3fe96f0779309caafb59860704bb

xuechendi approved these changes

View reviewed changes

xuechendi merged commit b8515d5 into vllm-project:main

44 checks passed

mhelf-intel pushed a commit to mhelf-intel/vllm-gaudi that referenced this pull request


          Spec decode warmup support (vllm-project#624)

23ba198

GAUDISW-242931

Because currently spec decode flatten the spec decode tokens into
[batch_size * num_tokens, 1], we can warmup the decode shapes as it was.
The thing changed is the maximum batch_size we should warmup in the
configuration because the real batch size is batch_size * num_tokens
which is num_tokens (1 + num_speculative_tokens) times of original batch
size.

The thing to care in the warmup is the draft token (and block) space for
the proposing process in eagle. We need to leave out the
num_speculative_tokens space to use by propose for eagle.

Other care needs to be taken (already done in the PR of support
num_speculative_tokens > 1) is warmup will be run in compile only mode
without the real computation happening. So the operations for
prepare_attn_metadata in the drafter which depends on the real position
values must be done on CPU)

Another issue of handling no spec decode tokens for decode phase has
already been handled vllm-project#593

---------

Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

xuechendi xuechendi approved these changes

kzawora-intel Awaiting requested review from kzawora-intel kzawora-intel is a code owner

adobrzyn Awaiting requested review from adobrzyn adobrzyn is a code owner

mgawarkiewicz-intel Awaiting requested review from mgawarkiewicz-intel mgawarkiewicz-intel is a code owner

vivekgoe Awaiting requested review from vivekgoe

afierka-intel Awaiting requested review from afierka-intel afierka-intel is a code owner

michalkuligowski Awaiting requested review from michalkuligowski michalkuligowski is a code owner

iboiko-habana Awaiting requested review from iboiko-habana iboiko-habana is a code owner

kamil-kaczor Awaiting requested review from kamil-kaczor kamil-kaczor is a code owner

ksmusz Awaiting requested review from ksmusz ksmusz is a code owner

Labels

None yet