-
Notifications
You must be signed in to change notification settings - Fork 77
Spec decode warmup support #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec decode warmup support #624
Conversation
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
2b997ab to
dcd6a04
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
dcd6a04 to
40beea3
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
|
@xuechendi Would you please have a review on this? |
|
|
||
| # Leave space for the output token and draft tokens to propose | ||
| num_lookahead_tokens = 1 | ||
| if self.speculative_config and self.speculative_config.use_eagle(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here
| seq_len = (req_state.num_computed_tokens + scheduler_output.num_scheduled_tokens[req_id]) | ||
| # Cannot use num_computed_tokens + num_scheduled_tokens here | ||
| # as it may include rejected spec decode tokens | ||
| seq_len = self.input_batch.num_tokens_no_spec[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
5afdf31 to
01cf4ad
Compare
|
Originally, I take the simplified approach that it's the user to configure the bucketing parameters properly to cover the shapes (batch_size, 1, context_blocks) for spec decode. Please note: Current spec decode has some tricky things as to the bucketed batch size and context blocks:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
|
@jerrychenhf , please add profiling to ticket to indicate the hpu graph have been captured for spec decode. |
Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
c5a6068 to
5dc4522
Compare
|
Updated the test results to GAUDISW-242931. With warmup enabled with this PR, running "PT_HPU_LAZY_MODE=1 python tests/full_tests/spec_decode.py --task eagle3 --batch_size 8 --osl 2048--num_spec_tokens 2", no configurations were shown as "not warmup up". |
5dc4522 to
5a66017
Compare
Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
5a66017 to
7061d1b
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
GAUDISW-242931 Because currently spec decode flatten the spec decode tokens into [batch_size * num_tokens, 1], we can warmup the decode shapes as it was. The thing changed is the maximum batch_size we should warmup in the configuration because the real batch size is batch_size * num_tokens which is num_tokens (1 + num_speculative_tokens) times of original batch size. The thing to care in the warmup is the draft token (and block) space for the proposing process in eagle. We need to leave out the num_speculative_tokens space to use by propose for eagle. Other care needs to be taken (already done in the PR of support num_speculative_tokens > 1) is warmup will be run in compile only mode without the real computation happening. So the operations for prepare_attn_metadata in the drafter which depends on the real position values must be done on CPU) Another issue of handling no spec decode tokens for decode phase has already been handled vllm-project#593 --------- Signed-off-by: Chen Haifeng <haifeng.chen@intel.com>
GAUDISW-242931
Because currently spec decode flatten the spec decode tokens into [batch_size * num_tokens, 1], we can warmup the decode shapes as it was. The thing changed is the maximum batch_size we should warmup in the configuration because the real batch size is batch_size * num_tokens which is num_tokens (1 + num_speculative_tokens) times of original batch size.
The thing to care in the warmup is the draft token (and block) space for the proposing process in eagle. We need to leave out the num_speculative_tokens space to use by propose for eagle.
Other care needs to be taken (already done in the PR of support num_speculative_tokens > 1) is warmup will be run in compile only mode without the real computation happening. So the operations for prepare_attn_metadata in the drafter which depends on the real position values must be done on CPU)
Another issue of handling no spec decode tokens for decode phase has already been handled #593