-
Notifications
You must be signed in to change notification settings - Fork 622
Open
Labels
RFCRequest For CommentsRequest For Comments
Description
Motivation.
To avoid maintaining a variety of models, we propose to remove all modeling files in vllm-ascend. To reach this, there are some refactors need to be done for multi-modal models in both vllm and vllm-ascend.
Proposed Change.
vllm:
- Extract Qwen MMEncoder layer as custom op. @shen-shanshan
- Extract
apply_rotary_embas CustomOp. @shen-shanshan [CustomOp] Extractapply_rotary_embas CustomOp and unify the dispatch logic vllm#29873 - Extract conv layer as custom op. @shen-shanshan [Model][MM] Extract conv layer as CustomOp vllm#28455
- Use caching to remove repeated sin/cos computations. @gcanlin [Model][Perf] Use cos and sin cache in QwenVL vllm#28798
- Remove redundant TP logic in split_qkv. @gcanlin [Refactor] Remove redundant TP gather/split in split_qkv in QwenVL vllm#28271
vllm-ascend:
- Patch VisionAttention layer and remove Qwen2.5-VL modeling files. @shen-shanshan [MM][Model][Perf] Remove Qwen2.5-VL modeling files and add patch for VisionAttention #4349
- Remove Qwen2-VL modeling files. @shen-shanshan [MM][Model] Remove Qwen2-VL modeling files #4534
- Remove Qwen3-VL and Qwen3-VL-MoE modeling files. @shen-shanshan [MM][Model] Remove Qwen3-VL modeling files #4577
- Implement ascend ViT csutom op and register it. @shen-shanshan
- Implement
multimodal_cpu_fieldsin model runner to guarantee thatgrid_thwshould be moved to cpu before converting to numpy. @zhangxinyuehfad - Refactor
set_ascend_forward_context()to remove patch for ViT embedding. @gcanlin - Remove patch for cos/sin cache. @shen-shanshan
Other related:
- Make mamba backend pluggable. @shen-shanshan [Model][Mamba] Add selector for mamba attention backend and make it pluggable for other device vllm#26487
Feedback Period.
No response
CC List.
Any Other Things.
No response
gcanlin, Yikun and MengqingCao
Metadata
Metadata
Assignees
Labels
RFCRequest For CommentsRequest For Comments