Skip to content

Commit 7249dd7

Browse files
layernorm: enlarge the range for 2-pass reduction (#2282)
From the OOB models, there are some shapes still below the performance expectation with large M but small N. Simple shapes: [128, 197, 384] [64,784, 256] [(64, 28, 28, 256] [256, 197, 256] [128, 196, 384] After enlarging the range for 2-pass reduction, these models can benefit an average of 10-20ms model execution time and optimize the geomean performance of eager training in timm models from 0.835 to 0.842. --------- Co-authored-by: Eikan Wang <eikan.wang@intel.com>
1 parent a5f45df commit 7249dd7

File tree

1 file changed

+4
-2
lines changed

1 file changed

+4
-2
lines changed

src/ATen/native/xpu/sycl/LayerNormKernels.cpp

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1063,8 +1063,10 @@ void _layer_norm_backward_kernel(
10631063
norm_config_global_size / syclMaxSubGroupSize() * 2 <= thread_slots;
10641064
// cuda uses condition M > 64 * 1024 && N / 32 < sm_count / 2 to parallelize
10651065
// in the M dimension
1066-
if (use_two_stage_col_reduction && M > 64 * 1024 &&
1067-
N / 32 < syclGpuEuCount() / syclGpuEUCountPerSubslice() / 2) {
1066+
int xe_core_count = syclGpuEuCount() / syclGpuEUCountPerSubslice();
1067+
int tile_n = N / 32;
1068+
if (use_two_stage_col_reduction && M > xe_core_count * 1024 &&
1069+
tile_n < xe_core_count * 2) {
10681070
const size_t local_size_x = 8;
10691071
const size_t SIMD = 32;
10701072
// workgroup size is 256

0 commit comments

Comments
 (0)