-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Remove upstream fa checks #29471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove upstream fa checks #29471
Conversation
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| else: | ||
| return AttentionBackendEnum.TORCH_SDPA, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restore CUDA path in vit flash-attn selection
maybe_get_vit_flash_attn_backend now returns TORCH_SDPA for any platform that is neither ROCm nor XPU. CUDA is caught by this else branch, so even when get_vit_attn_backend selects FLASH_ATTN the function forces Torch SDPA and never loads the flash attention kernel. This effectively disables flash attention for all vision models on CUDA, degrading the intended fast path everywhere and ignoring user overrides.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to remove upstream flash-attention checks. While most of the changes correctly remove the use_upstream_fa flag and related logic, I've identified a few critical issues. There's a regression in vllm/attention/layer.py that would disable FlashAttention for ViT models on CUDA. Additionally, there are remaining usages and imports of a removed function (check_upstream_fa_availability) in vllm/model_executor/models/paddleocr_vl.py and vllm/model_executor/models/qwen3_vl.py, which will cause runtime errors. These issues need to be addressed to ensure the correctness and performance of the codebase.
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
|
Need to check with AMD folks if there is a need from their side @gshtras |
|
@ywang96 Please also take look at the logic in maybe_get_vit_flash_attn_backend, thanks! |
vllm/attention/layer.py
Outdated
| ): | ||
| attn_backend = AttentionBackendEnum.FLASH_ATTN | ||
| use_upstream_fa = True | ||
| elif attn_backend_override is None \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add back the on_gfx9() condition here to differentiate between Radeon and Instinct GPUs.
On Radeon, only TORCH_SDPA is supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed this changes, thanks and please comment if there is anything else you notice
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
| from flash_attn import flash_attn_varlen_func | ||
| else: | ||
| from vllm.attention.utils.fa_utils import flash_attn_varlen_func | ||
| from vllm.attention.utils.fa_utils import flash_attn_varlen_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm/attention/utils/fa_utils.py does not have the logic for ROCm, flash_attn_varlen_func will be a None object if imported this way.
We can keep the import statement from flash_attn import flash_attn_varlen_func for now. Else we have to add this from flash_attn import flash_attn_varlen_func import statement into the vllm/attention/utils/fa_utils.py when platform is rocm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this import to fa_utils as it looks like the most simple way of it. And except message tells user to install upstream fa when import error is raised. Please check if that works, thanks!
| from flash_attn import flash_attn_varlen_func | ||
| else: | ||
| from vllm.attention.utils.fa_utils import flash_attn_varlen_func | ||
| from vllm.attention.utils.fa_utils import flash_attn_varlen_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like wise,
vllm/attention/utils/fa_utils.py does not have the logic for ROCm, flash_attn_varlen_func will be a None object if imported this way.
We can keep the import statement from flash_attn import flash_attn_varlen_func for now. Else we have to add this from flash_attn import flash_attn_varlen_func import statement into the vllm/attention/utils/fa_utils.py when platform is rocm.
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
ywang96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the precommit error but otherwise LGTM
cc @tjtanaa for final check on the changes for resolving FA import on ROCM platform.
|
@ywang96 Thanks. LGTM. It is using the flash attention and aiter flash attention. |
Purpose
According to #28763, the vllm flash attention has supported all headsize for vit module, thus removing the upstream flash-attn checks as they are no longer necessary.
Test Plan
Use one of impacted model qwen3-vl-235B as example to start the server
vllm serve RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 -tp 4 -dp 1 --mm-encoder-tp-mode data --enable-expert-parallel --async-scheduling --max-num-seqs 1024Test Result
Server started successfully
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.