-
Notifications
You must be signed in to change notification settings - Fork 629
[draft]qwen3-omni online reference server bugfix #4584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request appears to be a draft for a bugfix, but it introduces two critical issues. First, there's a typo in the NPUModelRunner's __init__ method that will cause a runtime error. Second, the logic for determining the attention state in _build_attn_state has been commented out and hardcoded, which will break different attention modes like decode and speculative decoding. These changes seem like debugging artifacts and should be reverted before this PR can be considered for merging.
| def __init__(self, vllm_config: VllmConfig, device: torch.device): | ||
| self.vllm_config = vllm_config | ||
| self.model_config = vllm_config.model_config | ||
| self.model_config = vllm_config.model_configi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a typo in this line. The attribute model_configi does not exist on vllm_config. This will cause an AttributeError during the initialization of NPUModelRunner. It should be model_config.
| self.model_config = vllm_config.model_configi | |
| self.model_config = vllm_config.model_config |
| attn_state = AscendAttentionState.ChunkedPrefill | ||
| # if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens): | ||
| # attn_state = AscendAttentionState.PrefillNoCache | ||
| # # We assume it is the decode stage, where prefill occurs but only one token is not hit in cache. | ||
| # elif np.all(num_scheduled_tokens == 1): | ||
| # attn_state = AscendAttentionState.DecodeOnly | ||
| # if self.speculative_config and self.speculative_config.method == 'deepseek_mtp': | ||
| # # SpecDecoding now supports seq_len=1 and seq_len=2 | ||
| # # In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1 | ||
| # attn_state = AscendAttentionState.SpecDecoding | ||
| # # Speculative decoding. | ||
| # elif np.all(num_valid_tokens == 1): | ||
| # if self.speculative_config and self.speculative_config.method == 'deepseek_mtp': | ||
| # attn_state = AscendAttentionState.SpecDecoding | ||
| # else: | ||
| # attn_state = AscendAttentionState.ChunkedPrefill | ||
| # # splitfuse | ||
| # elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled: | ||
| # attn_state = AscendAttentionState.ChunkedPrefill | ||
| # else: | ||
| # attn_state = AscendAttentionState.PrefillCacheHit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for determining the attention state (attn_state) has been commented out and hardcoded to AscendAttentionState.ChunkedPrefill. This will force all attention computations to use the chunked prefill path, which is incorrect for decode, speculative decoding, and other states. This change will likely lead to incorrect behavior and performance issues. This seems like a temporary debugging change that should be reverted before merging.
| attn_state = AscendAttentionState.ChunkedPrefill | |
| # if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens): | |
| # attn_state = AscendAttentionState.PrefillNoCache | |
| # # We assume it is the decode stage, where prefill occurs but only one token is not hit in cache. | |
| # elif np.all(num_scheduled_tokens == 1): | |
| # attn_state = AscendAttentionState.DecodeOnly | |
| # if self.speculative_config and self.speculative_config.method == 'deepseek_mtp': | |
| # # SpecDecoding now supports seq_len=1 and seq_len=2 | |
| # # In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1 | |
| # attn_state = AscendAttentionState.SpecDecoding | |
| # # Speculative decoding. | |
| # elif np.all(num_valid_tokens == 1): | |
| # if self.speculative_config and self.speculative_config.method == 'deepseek_mtp': | |
| # attn_state = AscendAttentionState.SpecDecoding | |
| # else: | |
| # attn_state = AscendAttentionState.ChunkedPrefill | |
| # # splitfuse | |
| # elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled: | |
| # attn_state = AscendAttentionState.ChunkedPrefill | |
| # else: | |
| # attn_state = AscendAttentionState.PrefillCacheHit | |
| if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens): | |
| attn_state = AscendAttentionState.PrefillNoCache | |
| # We assume it is the decode stage, where prefill occurs but only one token is not hit in cache. | |
| elif np.all(num_scheduled_tokens == 1): | |
| attn_state = AscendAttentionState.DecodeOnly | |
| if self.speculative_config and self.speculative_config.method == 'deepseek_mtp': | |
| # SpecDecoding now supports seq_len=1 and seq_len=2 | |
| # In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1 | |
| attn_state = AscendAttentionState.SpecDecoding | |
| # Speculative decoding. | |
| elif np.all(num_valid_tokens == 1): | |
| if self.speculative_config and self.speculative_config.method == 'deepseek_mtp': | |
| attn_state = AscendAttentionState.SpecDecoding | |
| else: | |
| attn_state = AscendAttentionState.ChunkedPrefill | |
| # splitfuse | |
| elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled: | |
| attn_state = AscendAttentionState.ChunkedPrefill | |
| else: | |
| attn_state = AscendAttentionState.PrefillCacheHit |
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
7e0b1d1 to
0a07c61
Compare
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: leo-pony <nengjunma@outlook.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?