layernorm: enlarge the range for 2-pass reduction (#2282)

weishi-deng · EikanWang · web-flow · commit 7249dd73c228 · 2025-11-19T12:35:37.000Z
From the OOB models, there are some shapes still below the performance
expectation with large M but small N.
Simple shapes:
[128, 197, 384]
[64,784, 256]
[(64, 28, 28, 256]
[256, 197, 256]
[128, 196, 384]
After enlarging the range for 2-pass reduction, these models can benefit
an average of 10-20ms model execution time and optimize the geomean
performance of eager training in timm models from 0.835 to 0.842.

---------

Co-authored-by: Eikan Wang &lt;eikan.wang@intel.com&gt;
diff --git a/src/ATen/native/xpu/sycl/LayerNormKernels.cpp b/src/ATen/native/xpu/sycl/LayerNormKernels.cpp
@@ -1063,8 +1063,10 @@ void _layer_norm_backward_kernel(
       norm_config_global_size / syclMaxSubGroupSize() * 2 <= thread_slots;
   // cuda uses condition M > 64 * 1024 && N / 32 < sm_count / 2 to parallelize
   // in the M dimension
-  if (use_two_stage_col_reduction && M > 64 * 1024 &&
-      N / 32 < syclGpuEuCount() / syclGpuEUCountPerSubslice() / 2) {
+  int xe_core_count = syclGpuEuCount() / syclGpuEUCountPerSubslice();
+  int tile_n = N / 32;
+  if (use_two_stage_col_reduction && M > xe_core_count * 1024 &&
+      tile_n < xe_core_count * 2) {
     const size_t local_size_x = 8;
     const size_t SIMD = 32;
     // workgroup size is 256