Remove upstream fa checks #29471

Victor49152 · 2025-11-26T02:26:43Z

Purpose

According to #28763, the vllm flash attention has supported all headsize for vit module, thus removing the upstream flash-attn checks as they are no longer necessary.

Test Plan

Use one of impacted model qwen3-vl-235B as example to start the server
vllm serve RedHatAI/Qwen3-VL-235B-A22B-Instruct-NVFP4 -tp 4 -dp 1 --mm-encoder-tp-mode data --enable-expert-parallel --async-scheduling --max-num-seqs 1024

Test Result

Server started successfully

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-11-26T02:28:58Z

vllm/attention/layer.py

    else:
        return AttentionBackendEnum.TORCH_SDPA, None


Restore CUDA path in vit flash-attn selection

maybe_get_vit_flash_attn_backend now returns TORCH_SDPA for any platform that is neither ROCm nor XPU. CUDA is caught by this else branch, so even when get_vit_attn_backend selects FLASH_ATTN the function forces Torch SDPA and never loads the flash attention kernel. This effectively disables flash attention for all vision models on CUDA, degrading the intended fast path everywhere and ignoring user overrides.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request aims to remove upstream flash-attention checks. While most of the changes correctly remove the use_upstream_fa flag and related logic, I've identified a few critical issues. There's a regression in vllm/attention/layer.py that would disable FlashAttention for ViT models on CUDA. Additionally, there are remaining usages and imports of a removed function (check_upstream_fa_availability) in vllm/model_executor/models/paddleocr_vl.py and vllm/model_executor/models/qwen3_vl.py, which will cause runtime errors. These issues need to be addressed to ensure the correctness and performance of the codebase.

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

mgoin · 2025-11-26T02:48:31Z

Need to check with AMD folks if there is a need from their side @gshtras

Victor49152 · 2025-11-26T02:51:15Z

@ywang96 Please also take look at the logic in maybe_get_vit_flash_attn_backend, thanks!

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

tjtanaa · 2025-11-26T03:25:04Z

vllm/attention/layer.py

-        ):
-            attn_backend = AttentionBackendEnum.FLASH_ATTN
-            use_upstream_fa = True
+        elif attn_backend_override is None \


we need to add back the on_gfx9() condition here to differentiate between Radeon and Instinct GPUs.

On Radeon, only TORCH_SDPA is supported.

Just pushed this changes, thanks and please comment if there is anything else you notice

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

tjtanaa · 2025-11-26T03:28:11Z

vllm/attention/layer.py

-                from flash_attn import flash_attn_varlen_func
-            else:
-                from vllm.attention.utils.fa_utils import flash_attn_varlen_func
+            from vllm.attention.utils.fa_utils import flash_attn_varlen_func


vllm/attention/utils/fa_utils.py does not have the logic for ROCm, flash_attn_varlen_func will be a None object if imported this way.

We can keep the import statement from flash_attn import flash_attn_varlen_func for now. Else we have to add this from flash_attn import flash_attn_varlen_func import statement into the vllm/attention/utils/fa_utils.py when platform is rocm.

I added this import to fa_utils as it looks like the most simple way of it. And except message tells user to install upstream fa when import error is raised. Please check if that works, thanks!

tjtanaa · 2025-11-26T03:29:35Z

vllm/attention/ops/vit_attn_wrappers.py

-            from flash_attn import flash_attn_varlen_func
-        else:
-            from vllm.attention.utils.fa_utils import flash_attn_varlen_func
+        from vllm.attention.utils.fa_utils import flash_attn_varlen_func


like wise,

vllm/attention/utils/fa_utils.py does not have the logic for ROCm, flash_attn_varlen_func will be a None object if imported this way.

We can keep the import statement from flash_attn import flash_attn_varlen_func for now. Else we have to add this from flash_attn import flash_attn_varlen_func import statement into the vllm/attention/utils/fa_utils.py when platform is rocm.

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

Signed-off-by: Roger Wang <hey@rogerw.io>

ywang96

I fixed the precommit error but otherwise LGTM

cc @tjtanaa for final check on the changes for resolving FA import on ROCM platform.

Signed-off-by: Roger Wang <hey@rogerw.io>

tjtanaa · 2025-11-28T13:51:45Z

@ywang96 Thanks. LGTM.

It is using the flash attention and aiter flash attention.
And the code path on ROCm is working, ChartQA score of Qwen/Qwen3-VL-8B-Instruct when using both backends are

================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.7948,
    "anywhere_in_answer_relaxed_correctness": 0.7988
}
================================================================================

remove upstream fa checks

0b01516

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

Victor49152 requested review from LucasWilkinson and sighingnow as code owners November 26, 2025 02:26

mergify bot added the qwen Related to Qwen models label Nov 26, 2025

chatgpt-codex-connector bot reviewed Nov 26, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

Victor49152 added 2 commits November 25, 2025 18:42

update the logic in maybe_get_vit_flash_attn_backend

ae44624

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

Remove deleted functions

2eaf2a4

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

Format

2bf56d3

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

wangshangsam assigned Victor49152 Nov 26, 2025

wangshangsam added the nvidia label Nov 26, 2025

github-project-automation bot added this to NVIDIA Nov 26, 2025

tjtanaa reviewed Nov 26, 2025

View reviewed changes

add back on_fgx9() check

0e89015

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

tjtanaa reviewed Nov 26, 2025

View reviewed changes

Victor49152 and others added 4 commits November 25, 2025 21:13

Add flash_attn_varlen_func import to fa_utils for rocm platform

10c34ad

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

typo fix

8dbf627

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

update

e2a5e5c

Signed-off-by: Roger Wang <hey@rogerw.io>

Merge branch 'main' into clean_up_upstream_fa

712da16

ywang96 approved these changes Nov 28, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 28, 2025

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025

ywang96 added 4 commits November 27, 2025 17:41

precommit

7e973ff

Signed-off-by: Roger Wang <hey@rogerw.io>

precommit

966854b

Signed-off-by: Roger Wang <hey@rogerw.io>

Merge branch 'main' into clean_up_upstream_fa

0fe954a

precommit

db4de68

Signed-off-by: Roger Wang <hey@rogerw.io>

vllm-bot merged commit 460d8bb into vllm-project:main Nov 28, 2025
51 of 53 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 28, 2025

a4lg mentioned this pull request Nov 29, 2025

[Frontend] Perform offline path replacement to tokenizer #29706

Merged

5 tasks

Uh oh!

Remove upstream fa checks #29471

Remove upstream fa checks #29471

Uh oh!

Conversation

Victor49152 commented Nov 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin commented Nov 26, 2025

Uh oh!

Victor49152 commented Nov 26, 2025

Uh oh!

tjtanaa Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Victor49152 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Victor49152 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Victor49152 commented Nov 26, 2025 •

edited by github-actions bot

Loading

tjtanaa Nov 26, 2025 •

edited

Loading

tjtanaa Nov 26, 2025 •

edited

Loading

tjtanaa commented Nov 28, 2025 •

edited

Loading