-
Notifications
You must be signed in to change notification settings - Fork 77
Add support of FP32 softmax to unified attention #577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support of FP32 softmax to unified attention #577
Conversation
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for FP32 precision softmax calculations in unified attention operations to maintain accuracy for certain Qwen2/Qwen2.5 family models. The change introduces conditional logic to perform QK matmul operations in FP32 when both use_output_tensor_in_matmulqk and fp32_softmax configuration flags are enabled.
- Adds FP32 precision support for attention score calculation
- Implements output tensor optimization for matmul operations when FP32 softmax is used
- Updates three attention functions:
partial_attn_causal,partial_attn_shared, andpartial_attn_unique
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
…ge (vllm-project#575) sampled_token_ids was changed from list[list[int]] to list[list[int]]: vllm-project/vllm#26368 Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
The change is to fix NIXL deployment procedure. --------- Signed-off-by: PatrykWo <patryk.wolsza@intel.com> Signed-off-by: Patryk Wolsza <patryk.wolsza@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Iryna Boiko <iboiko@habana.ai>
…oject#571) From: vllm-project#188 --------- Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai> Signed-off-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Some models from qwen2 or qwen2.5 family models require calculating attention with full fp32 precision to keep accurate results.
Adding support of FP32 qk matmuls in for unified attention operations.
Changes depends on: #571 and can be merged after.