Skip to content

Conversation

@Meihan-chen
Copy link
Contributor

@Meihan-chen Meihan-chen commented Dec 1, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request appears to be a draft for a bugfix, but it introduces two critical issues. First, there's a typo in the NPUModelRunner's __init__ method that will cause a runtime error. Second, the logic for determining the attention state in _build_attn_state has been commented out and hardcoded, which will break different attention modes like decode and speculative decoding. These changes seem like debugging artifacts and should be reverted before this PR can be considered for merging.

def __init__(self, vllm_config: VllmConfig, device: torch.device):
self.vllm_config = vllm_config
self.model_config = vllm_config.model_config
self.model_config = vllm_config.model_configi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a typo in this line. The attribute model_configi does not exist on vllm_config. This will cause an AttributeError during the initialization of NPUModelRunner. It should be model_config.

Suggested change
self.model_config = vllm_config.model_configi
self.model_config = vllm_config.model_config

Comment on lines +1946 to +1966
attn_state = AscendAttentionState.ChunkedPrefill
# if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens):
# attn_state = AscendAttentionState.PrefillNoCache
# # We assume it is the decode stage, where prefill occurs but only one token is not hit in cache.
# elif np.all(num_scheduled_tokens == 1):
# attn_state = AscendAttentionState.DecodeOnly
# if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':
# # SpecDecoding now supports seq_len=1 and seq_len=2
# # In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1
# attn_state = AscendAttentionState.SpecDecoding
# # Speculative decoding.
# elif np.all(num_valid_tokens == 1):
# if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':
# attn_state = AscendAttentionState.SpecDecoding
# else:
# attn_state = AscendAttentionState.ChunkedPrefill
# # splitfuse
# elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled:
# attn_state = AscendAttentionState.ChunkedPrefill
# else:
# attn_state = AscendAttentionState.PrefillCacheHit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for determining the attention state (attn_state) has been commented out and hardcoded to AscendAttentionState.ChunkedPrefill. This will force all attention computations to use the chunked prefill path, which is incorrect for decode, speculative decoding, and other states. This change will likely lead to incorrect behavior and performance issues. This seems like a temporary debugging change that should be reverted before merging.

Suggested change
attn_state = AscendAttentionState.ChunkedPrefill
# if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens):
# attn_state = AscendAttentionState.PrefillNoCache
# # We assume it is the decode stage, where prefill occurs but only one token is not hit in cache.
# elif np.all(num_scheduled_tokens == 1):
# attn_state = AscendAttentionState.DecodeOnly
# if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':
# # SpecDecoding now supports seq_len=1 and seq_len=2
# # In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1
# attn_state = AscendAttentionState.SpecDecoding
# # Speculative decoding.
# elif np.all(num_valid_tokens == 1):
# if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':
# attn_state = AscendAttentionState.SpecDecoding
# else:
# attn_state = AscendAttentionState.ChunkedPrefill
# # splitfuse
# elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled:
# attn_state = AscendAttentionState.ChunkedPrefill
# else:
# attn_state = AscendAttentionState.PrefillCacheHit
if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens):
attn_state = AscendAttentionState.PrefillNoCache
# We assume it is the decode stage, where prefill occurs but only one token is not hit in cache.
elif np.all(num_scheduled_tokens == 1):
attn_state = AscendAttentionState.DecodeOnly
if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':
# SpecDecoding now supports seq_len=1 and seq_len=2
# In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1
attn_state = AscendAttentionState.SpecDecoding
# Speculative decoding.
elif np.all(num_valid_tokens == 1):
if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':
attn_state = AscendAttentionState.SpecDecoding
else:
attn_state = AscendAttentionState.ChunkedPrefill
# splitfuse
elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled:
attn_state = AscendAttentionState.ChunkedPrefill
else:
attn_state = AscendAttentionState.PrefillCacheHit

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
@github-actions
Copy link

github-actions bot commented Dec 1, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: leo-pony <nengjunma@outlook.com>
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants