[draft]qwen3-omni online reference server bugfix #4584

Meihan-chen · 2025-12-01T02:45:05Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: 86e178f7c4d8c3b0eaf3c8e3f810a83f63b90e24
vLLM main: vllm-project/vllm@86e178f

gemini-code-assist

Code Review

This pull request appears to be a draft for a bugfix, but it introduces two critical issues. First, there's a typo in the NPUModelRunner's __init__ method that will cause a runtime error. Second, the logic for determining the attention state in _build_attn_state has been commented out and hardcoded, which will break different attention modes like decode and speculative decoding. These changes seem like debugging artifacts and should be reverted before this PR can be considered for merging.

gemini-code-assist · 2025-12-01T02:46:12Z

vllm_ascend/worker/model_runner_v1.py

    def __init__(self, vllm_config: VllmConfig, device: torch.device):
        self.vllm_config = vllm_config
-        self.model_config = vllm_config.model_config
+        self.model_config = vllm_config.model_configi


There appears to be a typo in this line. The attribute model_configi does not exist on vllm_config. This will cause an AttributeError during the initialization of NPUModelRunner. It should be model_config.

Suggested change

self.model_config = vllm_config.model_configi

self.model_config = vllm_config.model_config

gemini-code-assist · 2025-12-01T02:46:12Z

vllm_ascend/worker/model_runner_v1.py

+        attn_state = AscendAttentionState.ChunkedPrefill
+        # if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens):
+        #     attn_state = AscendAttentionState.PrefillNoCache
+        # # We assume it is the decode stage, where prefill occurs but only one token is not hit in cache.
+        # elif np.all(num_scheduled_tokens == 1):
+        #     attn_state = AscendAttentionState.DecodeOnly
+        #     if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':
+        #         # SpecDecoding now supports seq_len=1 and seq_len=2
+        #         # In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1
+        #         attn_state = AscendAttentionState.SpecDecoding
+        # # Speculative decoding.
+        # elif np.all(num_valid_tokens == 1):
+        #     if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':
+        #         attn_state = AscendAttentionState.SpecDecoding
+        #     else:
+        #         attn_state = AscendAttentionState.ChunkedPrefill
+        # # splitfuse
+        # elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled:
+        #     attn_state = AscendAttentionState.ChunkedPrefill
+        # else:
+        #     attn_state = AscendAttentionState.PrefillCacheHit


The logic for determining the attention state (attn_state) has been commented out and hardcoded to AscendAttentionState.ChunkedPrefill. This will force all attention computations to use the chunked prefill path, which is incorrect for decode, speculative decoding, and other states. This change will likely lead to incorrect behavior and performance issues. This seems like a temporary debugging change that should be reverted before merging.

Suggested change

attn_state = AscendAttentionState.ChunkedPrefill

# if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens):

# attn_state = AscendAttentionState.PrefillNoCache

# # We assume it is the decode stage, where prefill occurs but only one token is not hit in cache.

# elif np.all(num_scheduled_tokens == 1):

# attn_state = AscendAttentionState.DecodeOnly

# if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':

# # SpecDecoding now supports seq_len=1 and seq_len=2

# # In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1

# attn_state = AscendAttentionState.SpecDecoding

# # Speculative decoding.

# elif np.all(num_valid_tokens == 1):

# if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':

# attn_state = AscendAttentionState.SpecDecoding

# else:

# attn_state = AscendAttentionState.ChunkedPrefill

# # splitfuse

# elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled:

# attn_state = AscendAttentionState.ChunkedPrefill

# else:

# attn_state = AscendAttentionState.PrefillCacheHit

if np.array_equal(self.seq_lens_np[:num_reqs], num_scheduled_tokens):

attn_state = AscendAttentionState.PrefillNoCache

# We assume it is the decode stage, where prefill occurs but only one token is not hit in cache.

elif np.all(num_scheduled_tokens == 1):

attn_state = AscendAttentionState.DecodeOnly

if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':

# SpecDecoding now supports seq_len=1 and seq_len=2

# In Prefilling Decoding Disaggregation scenario, SpecDecoding need to supports seq_len=1

attn_state = AscendAttentionState.SpecDecoding

# Speculative decoding.

elif np.all(num_valid_tokens == 1):

if self.speculative_config and self.speculative_config.method == 'deepseek_mtp':

attn_state = AscendAttentionState.SpecDecoding

else:

attn_state = AscendAttentionState.ChunkedPrefill

# splitfuse

elif not ascend_config.ascend_scheduler_config.enabled or self.chunked_prefill_enabled:

attn_state = AscendAttentionState.ChunkedPrefill

else:

attn_state = AscendAttentionState.PrefillCacheHit

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>

github-actions · 2025-12-01T03:05:21Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: leo-pony <nengjunma@outlook.com>

github-actions · 2025-12-05T01:05:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

Meihan-chen added 2 commits December 1, 2025 10:46

qwen3-omni online reference server bugfix

5bda658

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>

qwen3-omni online reference server bugfix

0a07c61

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>

Meihan-chen force-pushed the qwen3-omni-bugfix branch from 7e0b1d1 to 0a07c61 Compare December 1, 2025 02:47

Add qwen3-30B-A3B-omni test cases

c11a23b

Signed-off-by: leo-pony <nengjunma@outlook.com>

github-actions bot added the module:tests label Dec 1, 2025

Merge branch 'vllm-project:main' into qwen3-omni-bugfix

0a1e033

leo-pony mentioned this pull request Dec 3, 2025

vLLM Ascend Model Support Priority #1608

Open

github-actions bot added the merge-conflicts label Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[draft]qwen3-omni online reference server bugfix #4584

[draft]qwen3-omni online reference server bugfix #4584

Meihan-chen commented Dec 1, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	self.model_config = vllm_config.model_configi
	self.model_config = vllm_config.model_config

[draft]qwen3-omni online reference server bugfix #4584

Are you sure you want to change the base?

[draft]qwen3-omni online reference server bugfix #4584

Conversation

Meihan-chen commented Dec 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 1, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Meihan-chen commented Dec 1, 2025 •

edited by github-actions bot

Loading