-
Notifications
You must be signed in to change notification settings - Fork 75
Use inline VISA to optimize horizontal batched subgroup reduce #4171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces an experimental inline VISA mechanism to optimize horizontal batched subgroup reduce in the Intel backend while providing stub implementations for NVIDIA and AMD backends. Key changes include:
- Adding a new warpBatchReduce function implementation with inline VISA in Intel’s TargetInfo.cpp.
- Updating header files across Intel, NVIDIA, and AMD backends and the base interface to declare/open up the new function.
- Integrating the new warpBatchReduce call into ReduceOpToLLVM.cpp for early returns when applicable.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.h | Add stub implementation for warpBatchReduce returning false. |
| third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.h | Declare the new warpBatchReduce function. |
| third_party/intel/lib/TritonIntelGPUToLLVM/TargetInfo.cpp | Implement experimental inline VISA-based warpBatchReduce logic. |
| third_party/intel/lib/TritonIntelGPUToLLVM/ReduceOpToLLVM.cpp | Integrate warpBatchReduce into reduce op conversion. |
| third_party/amd/lib/TritonAMDGPUToLLVM/TargetInfo.h | Add stub implementation for warpBatchReduce returning false. |
| include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h | Add pure virtual declaration for warpBatchReduce. |
| for (auto it : acc) { | ||
| const SmallVector<unsigned> &key = it.first; | ||
| SmallVector<Value> &val = acc[key]; |
Copilot
AI
May 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Iterating over 'acc' using 'auto it' and then accessing 'acc[key]' results in redundant lookups; consider using structured bindings (e.g., 'for (auto &pair : acc)') to improve clarity and efficiency.
| for (auto it : acc) { | |
| const SmallVector<unsigned> &key = it.first; | |
| SmallVector<Value> &val = acc[key]; | |
| for (auto &[key, val] : acc) { |
| if (!isSupportedWarpReduceOp(reduceOp, numLaneToReduce, warpSize)) | ||
| return false; | ||
|
|
||
| // It is only experimental code supports threads_per_warp=16 |
Copilot
AI
May 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The hard-coded check for warpSize == 16 limits the function to experimental scenarios; consider adding a comment or an assert to clarify the dependency on this constraint.
| // It is only experimental code supports threads_per_warp=16 | |
| // This code is experimental and currently supports only threads_per_warp=16. | |
| assert(warpSize == 16 && "This experimental code supports only warpSize of 16."); |
0ef1308 to
f925709
Compare
alexbaden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the GitHub issue for this work?
| unsigned threadOffsetOnReductionAxis = | ||
| helper.getThreadOffsetOnReductionAxis(); | ||
|
|
||
| auto ret = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add the method to the global target info if we are the only ones using it, inside files we control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still under investigating how to implement this to upstream.
Maybe we can use the in-tree MLIR ops: https://mlir.llvm.org/docs/Dialects/GPU/#gpusubgroup_reduce-gpusubgroupreduceop
|
f925709 to
8d4e6e0
Compare
|
This is a large PR which is going to be split into several small ones.
|
8d4e6e0 to
98ff036
Compare
While developing a kernel, I was given the error message "AssertionError()" without much helpful context on how to proceed with debugging. I could only solve it by understanding that part of the triton source code and spending half a day. That's why I'm (1) adding an error message to this part of the code, and (2) making the error message above it clearer (like it is in visit_While). This should allow the end user to debug this error without the need to dive into the triton source code.
98ff036 to
c167d43
Compare
|
The functionality of simd reduce is ready and tested. |
c167d43 to
14b732e
Compare
Signed-off-by: Lu,Chengjun <chengjun.lu@intel.com>
|
Due to #4171 (comment), moving this PR to draft. |
Use inline VISA to optimize horizontal batched subgroup reduce.
float32andfloat16for now.Run the unit test CI.