Skip to content

Conversation

@zhenwenqi2024
Copy link

@zhenwenqi2024 zhenwenqi2024 commented Dec 1, 2025

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
@github-actions
Copy link

github-actions bot commented Dec 1, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors NPUModelRunner to inherit from GPUModelRunner, which is a good step towards reducing code duplication and improving maintainability. However, the implementation has a couple of critical issues. The __init__ method of NPUModelRunner does not call the parent's __init__, leading to an improperly initialized object. Additionally, the load_model method has been removed, which causes the loss of crucial NPU-specific model loading logic. These issues need to be addressed to ensure the correctness and stability of the NPU backend.

class NPUModelRunner(LoRAModelRunnerMixin):
class NPUModelRunner(GPUModelRunner, LoRAModelRunnerMixin):

def __init__(self, vllm_config: VllmConfig, device: torch.device):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The NPUModelRunner now inherits from GPUModelRunner, but its __init__ method does not call super().__init__(). This is a critical issue as it breaks the inheritance chain, and the GPUModelRunner part of the object is not properly initialized. This can lead to AttributeErrors and other subtle bugs.

The __init__ method of NPUModelRunner duplicates a lot of code from GPUModelRunner.__init__. The correct approach is to call super().__init__(vllm_config, device) at the beginning of NPUModelRunner.__init__ and then only perform NPU-specific initializations or overrides. This will make the code more maintainable and less prone to errors when the base GPUModelRunner changes.

Comment on lines 3178 to -3206
self.eplb_updator.warm_up_eplb()

def load_model(self) -> None:
logger.info("Starting to load model %s...", self.model_config.model)

with DeviceMemoryProfiler() as m: # noqa: SIM117
self.model = get_model(vllm_config=self.vllm_config)
if self.dynamic_eplb:
model_register(self.model, self.model_config)
if get_ascend_device_type() == AscendDeviceType._310P:
from vllm.model_executor.layers.linear import (
MergedColumnParallelLinear, QKVParallelLinear,
RowParallelLinear)
for module in self.model.modules():
if isinstance(module,
(MergedColumnParallelLinear,
QKVParallelLinear, RowParallelLinear)):
module.weight.data = self._convert_torch_format(
module.weight.data)
if self.drafter:
logger.info("Loading drafter model...")
self.drafter.load_model(self.model)
if self.drafter.name == SpecDcodeType.EAGLE3:
self.model.set_aux_hidden_state_layers(
self.model.get_eagle3_aux_hidden_state_layers())

if self.lora_config:
self.model = self.load_lora_model(self.model, self.vllm_config,
self.device)
logger.info("Loading model weights took %.4f GB",
m.consumed_memory / float(2**30))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The load_model method has been completely removed from NPUModelRunner. Since NPUModelRunner now inherits from GPUModelRunner, this means GPUModelRunner.load_model will be called. However, the original NPUModelRunner.load_model contained crucial NPU-specific logic that is now lost. This includes:

  • model_register for dynamic_eplb.
  • Weight format conversion for _310P devices.
  • Setting auxiliary hidden state layers for EAGLE3 speculative decoding.

This will likely break these features. Instead of removing the method, you should override it, call super().load_model(), and then execute the NPU-specific logic.

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant