-
Notifications
You must be signed in to change notification settings - Fork 622
eagle reconstruct #4583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
eagle reconstruct #4583
Conversation
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors NPUModelRunner to inherit from GPUModelRunner, which is a good step towards reducing code duplication and improving maintainability. However, the implementation has a couple of critical issues. The __init__ method of NPUModelRunner does not call the parent's __init__, leading to an improperly initialized object. Additionally, the load_model method has been removed, which causes the loss of crucial NPU-specific model loading logic. These issues need to be addressed to ensure the correctness and stability of the NPU backend.
| class NPUModelRunner(LoRAModelRunnerMixin): | ||
| class NPUModelRunner(GPUModelRunner, LoRAModelRunnerMixin): | ||
|
|
||
| def __init__(self, vllm_config: VllmConfig, device: torch.device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NPUModelRunner now inherits from GPUModelRunner, but its __init__ method does not call super().__init__(). This is a critical issue as it breaks the inheritance chain, and the GPUModelRunner part of the object is not properly initialized. This can lead to AttributeErrors and other subtle bugs.
The __init__ method of NPUModelRunner duplicates a lot of code from GPUModelRunner.__init__. The correct approach is to call super().__init__(vllm_config, device) at the beginning of NPUModelRunner.__init__ and then only perform NPU-specific initializations or overrides. This will make the code more maintainable and less prone to errors when the base GPUModelRunner changes.
| self.eplb_updator.warm_up_eplb() | ||
|
|
||
| def load_model(self) -> None: | ||
| logger.info("Starting to load model %s...", self.model_config.model) | ||
|
|
||
| with DeviceMemoryProfiler() as m: # noqa: SIM117 | ||
| self.model = get_model(vllm_config=self.vllm_config) | ||
| if self.dynamic_eplb: | ||
| model_register(self.model, self.model_config) | ||
| if get_ascend_device_type() == AscendDeviceType._310P: | ||
| from vllm.model_executor.layers.linear import ( | ||
| MergedColumnParallelLinear, QKVParallelLinear, | ||
| RowParallelLinear) | ||
| for module in self.model.modules(): | ||
| if isinstance(module, | ||
| (MergedColumnParallelLinear, | ||
| QKVParallelLinear, RowParallelLinear)): | ||
| module.weight.data = self._convert_torch_format( | ||
| module.weight.data) | ||
| if self.drafter: | ||
| logger.info("Loading drafter model...") | ||
| self.drafter.load_model(self.model) | ||
| if self.drafter.name == SpecDcodeType.EAGLE3: | ||
| self.model.set_aux_hidden_state_layers( | ||
| self.model.get_eagle3_aux_hidden_state_layers()) | ||
|
|
||
| if self.lora_config: | ||
| self.model = self.load_lora_model(self.model, self.vllm_config, | ||
| self.device) | ||
| logger.info("Loading model weights took %.4f GB", | ||
| m.consumed_memory / float(2**30)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The load_model method has been completely removed from NPUModelRunner. Since NPUModelRunner now inherits from GPUModelRunner, this means GPUModelRunner.load_model will be called. However, the original NPUModelRunner.load_model contained crucial NPU-specific logic that is now lost. This includes:
model_registerfordynamic_eplb.- Weight format conversion for
_310Pdevices. - Setting auxiliary hidden state layers for
EAGLE3speculative decoding.
This will likely break these features. Instead of removing the method, you should override it, call super().load_model(), and then execute the NPU-specific logic.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?