Skip to content

Conversation

@jerrychenhf
Copy link
Contributor

@jerrychenhf jerrychenhf commented Nov 25, 2025

GAUDISW-242931

Because currently spec decode flatten the spec decode tokens into [batch_size * num_tokens, 1], we can warmup the decode shapes as it was. The thing changed is the maximum batch_size we should warmup in the configuration because the real batch size is batch_size * num_tokens which is num_tokens (1 + num_speculative_tokens) times of original batch size.

The thing to care in the warmup is the draft token (and block) space for the proposing process in eagle. We need to leave out the num_speculative_tokens space to use by propose for eagle.

Other care needs to be taken (already done in the PR of support num_speculative_tokens > 1) is warmup will be run in compile only mode without the real computation happening. So the operations for prepare_attn_metadata in the drafter which depends on the real position values must be done on CPU)

Another issue of handling no spec decode tokens for decode phase has already been handled #593

@github-actions
Copy link

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

@jerrychenhf jerrychenhf force-pushed the spec-decode-warmup-support branch from dcd6a04 to 40beea3 Compare December 1, 2025 02:25
@github-actions
Copy link

github-actions bot commented Dec 1, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

@jerrychenhf
Copy link
Contributor Author

@xuechendi Would you please have a review on this?


# Leave space for the output token and draft tokens to propose
num_lookahead_tokens = 1
if self.speculative_config and self.speculative_config.use_eagle():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here

seq_len = (req_state.num_computed_tokens + scheduler_output.num_scheduled_tokens[req_id])
# Cannot use num_computed_tokens + num_scheduled_tokens here
# as it may include rejected spec decode tokens
seq_len = self.input_batch.num_tokens_no_spec[i]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@jerrychenhf jerrychenhf force-pushed the spec-decode-warmup-support branch from 5afdf31 to 01cf4ad Compare December 2, 2025 07:42
@jerrychenhf
Copy link
Contributor Author

Originally, I take the simplified approach that it's the user to configure the bucketing parameters properly to cover the shapes (batch_size, 1, context_blocks) for spec decode.
The new commit provide the solution in bucket manager to automatically generate possible new buckets for spec decode based on the buckets configured by user (seed buckets). This somewhat simplify the user configuration.

Please note: Current spec decode has some tricky things as to the bucketed batch size and context blocks:

  1. We first get a padded_batch_size based on the num_decodes:

padded_batch_size = self.bucketing_manager.find_decode_bucket(num_decodes, sum(num_blocks))[0]

  1. And then a virtual batch size is got from padded_batch_size * num_tokens. This is the real running batch size, which is not ganrateed to be covered by any bucketing (if there is no special handling).
  2. And then bucketed context_blocks is find using the virtual batch size (but we used only the bucketed blocks but the bucketed batch size is not used, the virtual batch size = padded_batch_size * num_tokens is still used).
block_bucket_size = \
                self.bucketing_manager.find_decode_bucket(batch_size,
                                                          len(block_list))[2]

@github-actions
Copy link

github-actions bot commented Dec 2, 2025

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

@xuechendi
Copy link
Collaborator

@jerrychenhf , please add profiling to ticket to indicate the hpu graph have been captured for spec decode.

Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
@jerrychenhf jerrychenhf force-pushed the spec-decode-warmup-support branch from c5a6068 to 5dc4522 Compare December 3, 2025 02:56
@jerrychenhf
Copy link
Contributor Author

Updated the test results to GAUDISW-242931. With warmup enabled with this PR, running "PT_HPU_LAZY_MODE=1 python tests/full_tests/spec_decode.py --task eagle3 --batch_size 8 --osl 2048--num_spec_tokens 2", no configurations were shown as "not warmup up".

@jerrychenhf jerrychenhf force-pushed the spec-decode-warmup-support branch from 5dc4522 to 5a66017 Compare December 3, 2025 03:52
Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
@jerrychenhf jerrychenhf force-pushed the spec-decode-warmup-support branch from 5a66017 to 7061d1b Compare December 3, 2025 06:23
@github-actions
Copy link

github-actions bot commented Dec 3, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
d7284a2604ef3fe96f0779309caafb59860704bb

@xuechendi xuechendi merged commit b8515d5 into vllm-project:main Dec 3, 2025
44 checks passed
mhelf-intel pushed a commit to mhelf-intel/vllm-gaudi that referenced this pull request Dec 5, 2025
GAUDISW-242931

Because currently spec decode flatten the spec decode tokens into
[batch_size * num_tokens, 1], we can warmup the decode shapes as it was.
The thing changed is the maximum batch_size we should warmup in the
configuration because the real batch size is batch_size * num_tokens
which is num_tokens (1 + num_speculative_tokens) times of original batch
size.

The thing to care in the warmup is the draft token (and block) space for
the proposing process in eagle. We need to leave out the
num_speculative_tokens space to use by propose for eagle.

Other care needs to be taken (already done in the PR of support
num_speculative_tokens > 1) is warmup will be run in compile only mode
without the real computation happening. So the operations for
prepare_attn_metadata in the drafter which depends on the real position
values must be done on CPU)

Another issue of handling no spec decode tokens for decode phase has
already been handled vllm-project#593

---------

Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants