Skip to content

Commit 7b2e367

Browse files
authored
[webgpu] Optimize DP4AMatMulNBitsSmallMProgram for intel (microsoft#25192)
### Description This PR optimizes the Intel GPU path for the `DP4AMatMulNBitsSmallMProgram` by tuning `tile_size` and `tile_size_k_vec`. ### Motivation and Context With this change, we achieved >8% performance boost on Intel iGPUs (Xe-LP and Xe2-LPG) for phi-4-mini-accuracy4 model.
1 parent dafa7f9 commit 7b2e367

File tree

1 file changed

+5
-0
lines changed

1 file changed

+5
-0
lines changed

onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_nbits.cc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -596,6 +596,11 @@ Status ApplyDP4AMatrixMatMulNBits(const Tensor* a, const Tensor* b, const Tensor
596596
uint32_t tile_size_k_vec = 16;
597597
uint32_t tile_size = 32;
598598

599+
if (context.AdapterInfo().vendor == std::string_view{"intel"}) {
600+
tile_size_k_vec = 32;
601+
tile_size = 4;
602+
}
603+
599604
DP4AMatMulNBitsSmallMProgram mul_program{tile_size_k_vec, tile_size, nbits, has_zero_points};
600605
uint32_t num_N_tile = (N + tile_size - 1) / tile_size;
601606
mul_program.SetWorkgroupSize(128);

0 commit comments

Comments
 (0)