Use inline VISA to optimize horizontal batched subgroup reduce #4171

chengjunlu · 2025-05-12T03:39:31Z

Use inline VISA to optimize horizontal batched subgroup reduce.

Support the SIMD reduce on ATS for threads_per_warp=16
Support the SIMD reduce on PVC for threads_per_warp=16 or 32.
There is limitation that:

Only supports the float32 and float16 for now.
Only supports the sum and max for now.

Run the unit test CI.

Copilot

Pull Request Overview

This PR introduces an experimental inline VISA mechanism to optimize horizontal batched subgroup reduce in the Intel backend while providing stub implementations for NVIDIA and AMD backends. Key changes include:

Adding a new warpBatchReduce function implementation with inline VISA in Intel’s TargetInfo.cpp.
Updating header files across Intel, NVIDIA, and AMD backends and the base interface to declare/open up the new function.
Integrating the new warpBatchReduce call into ReduceOpToLLVM.cpp for early returns when applicable.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.h	Add stub implementation for warpBatchReduce returning false.
third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.h	Declare the new warpBatchReduce function.
third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.cpp	Implement experimental inline VISA-based warpBatchReduce logic.
third_party/intel/lib/TritonIntelGPUToLLVM/ReduceOpToLLVM.cpp	Integrate warpBatchReduce into reduce op conversion.
third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.h	Add stub implementation for warpBatchReduce returning false.
include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h	Add pure virtual declaration for warpBatchReduce.

Copilot · 2025-05-12T03:40:16Z

third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.cpp

+    for (auto it : acc) {
+      const SmallVector<unsigned> &key = it.first;
+      SmallVector<Value> &val = acc[key];


[nitpick] Iterating over 'acc' using 'auto it' and then accessing 'acc[key]' results in redundant lookups; consider using structured bindings (e.g., 'for (auto &pair : acc)') to improve clarity and efficiency.

Suggested change

for (auto it : acc) {

const SmallVector<unsigned> &key = it.first;

SmallVector<Value> &val = acc[key];

for (auto &[key, val] : acc) {

Copilot · 2025-05-12T03:40:17Z

third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.cpp

+  if (!isSupportedWarpReduceOp(reduceOp, numLaneToReduce, warpSize))
+    return false;
+
+  // It is only experimental code supports threads_per_warp=16


[nitpick] The hard-coded check for warpSize == 16 limits the function to experimental scenarios; consider adding a comment or an assert to clarify the dependency on this constraint.

Suggested change

// It is only experimental code supports threads_per_warp=16

// This code is experimental and currently supports only threads_per_warp=16.

assert(warpSize == 16 && "This experimental code supports only warpSize of 16.");

alexbaden

Where is the GitHub issue for this work?

alexbaden · 2025-05-13T16:26:03Z

third_party/intel/lib/TritonIntelGPUToLLVM/ReduceOpToLLVM.cpp

    unsigned threadOffsetOnReductionAxis =
        helper.getThreadOffsetOnReductionAxis();
+
+    auto ret =


Do we need to add the method to the global target info if we are the only ones using it, inside files we control?

It is still under investigating how to implement this to upstream.

Maybe we can use the in-tree MLIR ops: https://mlir.llvm.org/docs/Dialects/GPU/#gpusubgroup_reduce-gpusubgroupreduceop

chengjunlu · 2025-05-13T23:17:16Z

Where is the GitHub issue for this work?

#3310

chengjunlu · 2025-05-21T07:43:37Z

This is a large PR which is going to be split into several small ones.

Add SIMD reduce utils in XeAsmFormat.cpp and unit test.
Add a new placeholder in the ReduceOpLowering.cpp to override the default reduce with-in warp for multiple inputs.
Enable the SIMD reduce by default.

While developing a kernel, I was given the error message "AssertionError()" without much helpful context on how to proceed with debugging. I could only solve it by understanding that part of the triton source code and spending half a day. That's why I'm (1) adding an error message to this part of the code, and (2) making the error message above it clearer (like it is in visit_While). This should allow the end user to debug this error without the need to dive into the triton source code.

chengjunlu · 2025-09-02T01:08:26Z

The functionality of simd reduce is ready and tested.
The performance improvements is not expected on attention kernel for now. Need to further investigate.

Signed-off-by: Lu,Chengjun <chengjun.lu@intel.com>

whitneywhtsang · 2025-09-16T17:18:52Z

Due to #4171 (comment), moving this PR to draft.

chengjunlu requested a review from Copilot May 12, 2025 03:39

chengjunlu marked this pull request as draft May 12, 2025 03:39

Copilot AI reviewed May 12, 2025

View reviewed changes

chengjunlu force-pushed the chengjun/simd_reduce branch from 0ef1308 to f925709 Compare May 12, 2025 04:43

alexbaden reviewed May 13, 2025

View reviewed changes

chengjunlu mentioned this pull request May 13, 2025

Triton SIMD reduction investigation #3310

Open

chengjunlu force-pushed the chengjun/simd_reduce branch from f925709 to 8d4e6e0 Compare May 21, 2025 07:32

chengjunlu requested review from etiotto, mfrancepillois and whitneywhtsang May 21, 2025 07:40

chengjunlu force-pushed the chengjun/simd_reduce branch from 8d4e6e0 to 98ff036 Compare May 21, 2025 08:07

chengjunlu linked an issue Aug 3, 2025 that may be closed by this pull request

Triton SIMD reduction investigation #3310

Open

chengjunlu force-pushed the chengjun/simd_reduce branch from 98ff036 to c167d43 Compare September 2, 2025 01:06

chengjunlu marked this pull request as ready for review September 2, 2025 01:06

chengjunlu force-pushed the chengjun/simd_reduce branch from c167d43 to 14b732e Compare September 2, 2025 05:44

Lower the horizontal batched reduce of matrix with optimal inline VISA.

14b732e

Signed-off-by: Lu,Chengjun <chengjun.lu@intel.com>

whitneywhtsang marked this pull request as draft September 16, 2025 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use inline VISA to optimize horizontal batched subgroup reduce #4171

Use inline VISA to optimize horizontal batched subgroup reduce #4171

Uh oh!

chengjunlu commented May 12, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 12, 2025

Uh oh!

Copilot AI May 12, 2025

Uh oh!

alexbaden left a comment

Uh oh!

alexbaden May 13, 2025

Uh oh!

chengjunlu May 13, 2025

Uh oh!

chengjunlu commented May 13, 2025

Uh oh!

chengjunlu commented May 21, 2025

Uh oh!

chengjunlu commented Sep 2, 2025

Uh oh!

whitneywhtsang commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// It is only experimental code supports threads_per_warp=16
	// This code is experimental and currently supports only threads_per_warp=16.
	assert(warpSize == 16 && "This experimental code supports only warpSize of 16.");

Use inline VISA to optimize horizontal batched subgroup reduce #4171

Are you sure you want to change the base?

Use inline VISA to optimize horizontal batched subgroup reduce #4171

Uh oh!

Conversation

chengjunlu commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 12, 2025

Choose a reason for hiding this comment

Uh oh!

alexbaden left a comment

Choose a reason for hiding this comment

Uh oh!

alexbaden May 13, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu May 13, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu commented May 13, 2025

Uh oh!

chengjunlu commented May 21, 2025

Uh oh!

chengjunlu commented Sep 2, 2025

Uh oh!

whitneywhtsang commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chengjunlu commented May 12, 2025 •

edited

Loading