[webgpu] Optimize DP4AMatMulNBitsSmallMProgram for intel (microsoft#25192)

jing-bao · web-flow · commit 7b2e367454b4 · 2025-07-07T12:19:36.000-07:00
### Description
This PR optimizes the Intel GPU path for the
`DP4AMatMulNBitsSmallMProgram` by tuning `tile_size` and
`tile_size_k_vec`.



### Motivation and Context
With this change, we achieved &gt;8% performance boost on Intel iGPUs
(Xe-LP and Xe2-LPG) for phi-4-mini-accuracy4 model.
diff --git a/onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_nbits.cc b/onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_nbits.cc
@@ -596,6 +596,11 @@ Status ApplyDP4AMatrixMatMulNBits(const Tensor* a, const Tensor* b, const Tensor
     uint32_t tile_size_k_vec = 16;
     uint32_t tile_size = 32;
 
+    if (context.AdapterInfo().vendor == std::string_view{"intel"}) {
+      tile_size_k_vec = 32;
+      tile_size = 4;
+    }
+
     DP4AMatMulNBitsSmallMProgram mul_program{tile_size_k_vec, tile_size, nbits, has_zero_points};
     uint32_t num_N_tile = (N + tile_size - 1) / tile_size;
     mul_program.SetWorkgroupSize(128);