Optimize NVFP4 Triton kernel (NVIDIA#533)

mxinO · jQizhang · commit fd8db7b1a4cc · 2025-11-26T05:54:41.000-08:00
## What does this PR do? **Type of change:** Bug fix  **Overview:** 1. Use mak_block_ptr for loading blocks, now it's more safe, fix illegal memory access in rare cases. 2. Now the tile rows and columns can be specified separately. 3. Moving data type cast to kernel to save memory for bf16/fp16 inputs. 4. I did a benchmark comparing with the old kernel on H100 and B200, it has significant speed-up for medium and large size inputs (B200: 1.4x - 2x, H100: 1.7x - 2.8x) H100: ```shell Shape: 512x512 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 35.32 µs new kernel: 38.49 µs speedup: 0.92x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 43.48 µs new kernel: 44.78 µs speedup: 0.97x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 43.25 µs new kernel: 43.69 µs speedup: 0.99x Shape: 1024x1024 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 36.03 µs new kernel: 38.17 µs speedup: 0.94x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 44.24 µs new kernel: 43.78 µs speedup: 1.01x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 43.77 µs new kernel: 43.61 µs speedup: 1.00x Shape: 4096x4096 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 87.02 µs new kernel: 80.88 µs speedup: 1.08x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 116.12 µs new kernel: 65.80 µs speedup: 1.76x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 114.39 µs new kernel: 65.30 µs speedup: 1.75x Shape: 8192x8192 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 237.29 µs new kernel: 219.42 µs speedup: 1.08x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 349.76 µs new kernel: 138.66 µs speedup: 2.52x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 341.89 µs new kernel: 136.91 µs speedup: 2.50x Shape: 8192x12288 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 338.65 µs new kernel: 312.70 µs speedup: 1.08x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 505.63 µs new kernel: 188.24 µs speedup: 2.69x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 492.97 µs new kernel: 186.88 µs speedup: 2.64x Shape: 12288x12288 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 490.25 µs new kernel: 451.16 µs speedup: 1.09x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 736.04 µs new kernel: 261.94 µs speedup: 2.81x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 717.64 µs new kernel: 257.82 µs speedup: 2.78x Shape: 32x4096 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 35.61 µs new kernel: 38.23 µs speedup: 0.93x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 43.00 µs new kernel: 43.85 µs speedup: 0.98x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 42.83 µs new kernel: 44.13 µs speedup: 0.97x Shape: 1024x4096 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 38.12 µs new kernel: 41.28 µs speedup: 0.92x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 52.80 µs new kernel: 45.96 µs speedup: 1.15x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 51.56 µs new kernel: 45.30 µs speedup: 1.14x Shape: 32x5000 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 41.70 µs new kernel: 38.03 µs speedup: 1.10x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 52.95 µs new kernel: 44.14 µs speedup: 1.20x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 52.57 µs new kernel: 44.38 µs speedup: 1.18x Shape: 32x5000 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 41.70 µs new kernel: 38.03 µs speedup: 1.10x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 52.95 µs new kernel: 44.14 µs speedup: 1.20x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 52.57 µs new kernel: 44.38 µs speedup: 1.18x Shape: 128x8200 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 48.03 µs new kernel: 38.38 µs speedup: 1.25x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 60.54 µs new kernel: 44.51 µs speedup: 1.36x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 60.08 µs new kernel: 43.59 µs speedup: 1.38x ``` B200: ```shell Shape: 512x512 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 34.63 µs new kernel: 32.80 µs speedup: 1.06x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 42.26 µs new kernel: 40.92 µs speedup: 1.03x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 41.38 µs new kernel: 39.30 µs speedup: 1.05x Shape: 1024x1024 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 35.07 µs new kernel: 33.93 µs speedup: 1.03x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 43.57 µs new kernel: 39.55 µs speedup: 1.10x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 43.72 µs new kernel: 38.96 µs speedup: 1.12x Shape: 4096x4096 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 71.64 µs new kernel: 58.66 µs speedup: 1.22x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 81.67 µs new kernel: 57.98 µs speedup: 1.41x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 82.19 µs new kernel: 57.56 µs speedup: 1.43x Shape: 8192x8192 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 176.85 µs new kernel: 135.78 µs speedup: 1.30x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 217.99 µs new kernel: 121.84 µs speedup: 1.79x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 215.47 µs new kernel: 117.41 µs speedup: 1.84x Shape: 8192x12288 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 248.18 µs new kernel: 186.64 µs speedup: 1.33x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 306.25 µs new kernel: 163.28 µs speedup: 1.88x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 303.06 µs new kernel: 157.59 µs speedup: 1.92x Shape: 12288x12288 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 354.23 µs new kernel: 262.99 µs speedup: 1.35x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 439.44 µs new kernel: 224.71 µs speedup: 1.96x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 434.23 µs new kernel: 217.62 µs speedup: 2.00x Shape: 32x4096 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 35.90 µs new kernel: 34.88 µs speedup: 1.03x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 43.77 µs new kernel: 41.49 µs speedup: 1.05x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 43.22 µs new kernel: 41.79 µs speedup: 1.03x Shape: 1024x4096 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 37.37 µs new kernel: 37.84 µs speedup: 0.99x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 49.69 µs new kernel: 43.85 µs speedup: 1.13x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 48.93 µs new kernel: 44.31 µs speedup: 1.10x Shape: 32x5000 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 41.83 µs new kernel: 35.44 µs speedup: 1.18x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 53.23 µs new kernel: 40.64 µs speedup: 1.31x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 54.39 µs new kernel: 40.77 µs speedup: 1.33x Shape: 128x8200 dtype: torch.float32 max abs diff: 0.000e+00 old kernel: 49.35 µs new kernel: 35.33 µs speedup: 1.40x dtype: torch.bfloat16 max abs diff: 0.000e+00 old kernel: 60.89 µs new kernel: 41.46 µs speedup: 1.47x dtype: torch.float16 max abs diff: 0.000e+00 old kernel: 61.75 µs new kernel: 41.75 µs speedup: 1.48x ``` ## Testing  1. Compared with old kernel, diff=0 2. Benchmark speed ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes  - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/CHANGELOG.rst)?**: No  ## Additional Information Bug [5612406] --------- Signed-off-by: mxin <mxin@nvidia.com>
diff --git a/modelopt/torch/quantization/triton/fp4_kernel.py b/modelopt/torch/quantization/triton/fp4_kernel.py
@@ -27,67 +27,83 @@
 __all__ = ["fp4_fake_quant_block"]
 
 
+_TORCH_TO_TL_DTYPE = {
+    torch.float32: tl.float32,
+    torch.float: tl.float32,
+    torch.float16: tl.float16,
+    torch.half: tl.float16,
+    torch.bfloat16: tl.bfloat16,
+}
+
+
+def _torch_dtype_to_tl(dtype: torch.dtype):
+    if dtype not in _TORCH_TO_TL_DTYPE:
+        raise ValueError(f"Unsupported dtype for fp4 fake quantization: {dtype}")
+    return _TORCH_TO_TL_DTYPE[dtype]
+
+
 @triton.jit
 def fp4_fake_quant_kernel(
     x_ptr,
     y_ptr,
     M,
     N,
     global_scale_ptr,
+    stride_xm,
+    stride_xn,
+    stride_ym,
+    stride_yn,
     BLOCK_SIZE: tl.constexpr,
-    TILE_SIZE: tl.constexpr,
+    TILE_M: tl.constexpr,
+    TILE_N: tl.constexpr,
     NUM_FP4_BLOCKS: tl.constexpr,
+    OUT_DTYPE: tl.constexpr,
 ):
-    """Applies FP4 fake quantization on input data using per-block scaling factors.
-
-    Args:
-        x_ptr (tl.pointer): Pointer to the input tensor (BF16/FP32)
-        y_ptr (tl.pointer): Pointer to the output buffer
-        M (int): Number of rows in the matrix
-        N (int): Number of columns in the matrix
-        global_scale_ptr (tl.pointer): Pointer to the global scaling factor tensor
-        BLOCK_SIZE (tl.constexpr): Size of each FP4 quantization block
-        TILE_SIZE (tl.constexpr): Size of the processing block
-        NUM_FP4_BLOCKS (tl.constexpr): Number of FP4 blocks within TILE_SIZE
-    """
+    """Applies FP4 fake quantization using block pointers for memory addressing."""
     pid_m = tl.program_id(axis=0)
     pid_n = tl.program_id(axis=1)
 
-    # Load global scale from tensor
-    global_scale = tl.load(global_scale_ptr).to(tl.float32)
+    row_start = pid_m * TILE_M
+    col_start = pid_n * TILE_N
 
-    # Calculate offsets
-    offs_m = pid_m * TILE_SIZE + tl.arange(0, TILE_SIZE)
-    offs_n = pid_n * TILE_SIZE + tl.arange(0, TILE_SIZE)
-    offs = offs_m[:, None] * N + offs_n[None, :]
-    mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
+    x_block_ptr = tl.make_block_ptr(
+        base=x_ptr,
+        shape=(M, N),
+        strides=(stride_xm, stride_xn),
+        offsets=(row_start, col_start),
+        block_shape=(TILE_M, TILE_N),
+        order=(1, 0),
+    )
+    y_block_ptr = tl.make_block_ptr(
+        base=y_ptr,
+        shape=(M, N),
+        strides=(stride_ym, stride_yn),
+        offsets=(row_start, col_start),
+        block_shape=(TILE_M, TILE_N),
+        order=(1, 0),
+    )
+
+    global_scale = tl.load(global_scale_ptr).to(tl.float32)
+    global_scale_safe = tl.where(global_scale > 0.0, global_scale, 1e-12)
 
-    # Load input data
-    x = tl.load(x_ptr + offs, mask=mask).to(tl.float32)
+    tile = tl.load(x_block_ptr, boundary_check=(0, 1), padding_option="zero").to(tl.float32)
 
-    # Reshape for block processing
-    x_reshaped = tl.reshape(x, (TILE_SIZE, NUM_FP4_BLOCKS, BLOCK_SIZE))
-    x_abs = tl.abs(x_reshaped)
+    tile_reshaped = tl.reshape(tile, (TILE_M, NUM_FP4_BLOCKS, BLOCK_SIZE))
+    x_abs = tl.abs(tile_reshaped)
 
-    # Calculate max values for each FP4 block
     block_max = tl.max(x_abs, axis=2, keep_dims=True)
-    # global_scale = global_amax / (448 * 6)
-    block_max_quant = (
-        tl.minimum((block_max / (6.0 * global_scale)), 448.0).to(tl.float8e4nv).to(tl.float32)
-        * global_scale
-    )
 
-    # Broadcast max values
+    block_max_scaled = block_max / (6.0 * global_scale_safe)
+    block_max_scaled = tl.minimum(block_max_scaled, 448.0)
+    block_max_quant = block_max_scaled.to(tl.float8e4nv).to(tl.float32) * global_scale
+    block_max_quant = tl.where(block_max_quant >= 1e-5, block_max_quant, 1.0)
+
     block_max_quant_broadcast = tl.broadcast_to(
-        block_max_quant, (TILE_SIZE, NUM_FP4_BLOCKS, BLOCK_SIZE)
-    )
-    # Set scale to 1 if block amax is 0
-    block_max_quant_broadcast = tl.where(
-        block_max_quant_broadcast < 1e-5, 1.0, block_max_quant_broadcast
+        block_max_quant, (TILE_M, NUM_FP4_BLOCKS, BLOCK_SIZE)
     )
+
     abs_scaled = x_abs / block_max_quant_broadcast
 
-    # Quantize to FP4 values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}, following round to even
     q_val = tl.where(
         abs_scaled <= 0.25,
         0.0,
@@ -103,64 +119,92 @@ def fp4_fake_quant_kernel(
                     tl.where(
                         abs_scaled <= 2.5,
                         2.0,
-                        tl.where(abs_scaled < 3.5, 3.0, tl.where(abs_scaled <= 5.0, 4.0, 6.0)),
+                        tl.where(
+                            abs_scaled < 3.5,
+                            3.0,
+                            tl.where(abs_scaled <= 5.0, 4.0, 6.0),
+                        ),
                     ),
                 ),
             ),
         ),
     )
 
-    # Apply signs and rescale
     x_rescaled = q_val * block_max_quant_broadcast
-    x_rescaled = tl.where(x_reshaped >= 0, x_rescaled, -x_rescaled)
+    x_rescaled = tl.where(tile_reshaped >= 0, x_rescaled, -x_rescaled)
 
-    # Reshape back and store
-    x_rescaled = tl.reshape(x_rescaled, (TILE_SIZE, TILE_SIZE))
-    tl.store(y_ptr + offs, x_rescaled, mask=mask)
+    tile_quant = tl.reshape(x_rescaled, (TILE_M, TILE_N))
+
+    tl.store(y_block_ptr, tile_quant.to(OUT_DTYPE), boundary_check=(0, 1))
 
 
 def fp4_fake_quant_block(
     x: torch.Tensor,
     global_amax: torch.Tensor,
     block_size: int = 16,
-    tile_size: int = 128,
+    tile_rows: int = 16,
+    tile_cols: int = 64,
+    num_warps: int | None = None,
+    num_stages: int | None = None,
 ) -> torch.Tensor:
-    """Applies FP4 fake quantization on the input tensor.
+    """FP4 fake quantization implementation using block-pointer tiling.
 
     Args:
-        x (torch.Tensor): Input tensor of shape (M, N)
-        global_amax (torch.Tensor): Global max value of the input tensor
-            This needs to be a tensor to be cuda-graph compatible
-        block_size (int): Size of FP4 quantization blocks
-        tile_size (int): Size of processing blocks
+        x (torch.Tensor): Input tensor of shape ``(M, N)`` or higher.
+        global_amax (torch.Tensor): Global maximum value tensor for scaling.
+        block_size (int): Number of elements per FP4 block.
+        tile_rows (int, optional): Row tile size. Defaults to 64.
+        tile_cols (int, optional): Column tile size. Defaults to 128. Rounded up to
+            the nearest multiple of ``block_size`` internally.
+        num_warps (int | None, optional): Override for Triton warps. Autotuned when ``None``.
+        num_stages (int | None, optional): Override for pipeline stages. Autotuned when ``None``.
 
     Returns:
-        torch.Tensor: Quantized tensor of the same shape as input
+        torch.Tensor: Fake-quantized tensor matching the input shape and dtype.
     """
     x_shape = x.shape
     x_dtype = x.dtype
     x = x.reshape(-1, x_shape[-1]).contiguous()
 
-    M, N = x.size()
-    y = torch.empty_like(x, dtype=torch.get_default_dtype())
+    M, N = x.shape
+    y = torch.empty_like(x)
+
+    stride_xm, stride_xn = x.stride()
+    stride_ym, stride_yn = y.stride()
+
+    tile_cols = max(tile_cols, block_size)
+    tile_cols_aligned = ((tile_cols + block_size - 1) // block_size) * block_size
+    num_fp4_blocks = tile_cols_aligned // block_size
 
-    grid = lambda meta: (
-        triton.cdiv(M, meta["TILE_SIZE"]),
-        triton.cdiv(N, meta["TILE_SIZE"]),
-    )
     global_scale = global_amax.float() / (6.0 * 448.0)
-    num_fp4_blocks = tile_size // block_size
+
+    grid = lambda *_: (triton.cdiv(M, tile_rows), triton.cdiv(N, tile_cols_aligned))
+
+    launch_kwargs = {
+        "BLOCK_SIZE": block_size,
+        "TILE_M": tile_rows,
+        "TILE_N": tile_cols_aligned,
+        "NUM_FP4_BLOCKS": num_fp4_blocks,
+        "OUT_DTYPE": _torch_dtype_to_tl(x_dtype),
+    }
+    if num_warps is not None:
+        launch_kwargs["num_warps"] = num_warps
+    if num_stages is not None:
+        launch_kwargs["num_stages"] = num_stages
     fp4_fake_quant_kernel[grid](
         x,
         y,
         M,
         N,
         global_scale,
-        TILE_SIZE=tile_size,
-        BLOCK_SIZE=block_size,
-        NUM_FP4_BLOCKS=num_fp4_blocks,
+        stride_xm,
+        stride_xn,
+        stride_ym,
+        stride_yn,
+        **launch_kwargs,
     )
-    y = y.reshape(x_shape).contiguous().to(dtype=x_dtype)
+
+    y = y.view(*x_shape)
     return y