HIP: RDNA4 tensor core support for MMF #17077

zhang-hui-yulo · 2025-11-07T13:37:46Z

Add RDNA4 tensor core support for MMF, honestly the performance is lower than expectation. The model is at https://huggingface.co/Mungert/DeepSeek-R1-0528-Qwen3-8B-GGUF

Model	Microbatch size	Test	t/s master	t/s 672492fc	Speedup
qwen3 8B Q8_0	1	pp512	46.48	54.61	1.18
qwen3 8B Q8_0	2	pp512	89.96	85.92	0.96
qwen3 8B Q8_0	3	pp512	132.92	126.23	0.95
qwen3 8B Q8_0	4	pp512	176.06	166.12	0.94
qwen3 8B Q8_0	5	pp512	212.00	197.77	0.93
qwen3 8B Q8_0	6	pp512	252.54	233.83	0.93
qwen3 8B Q8_0	7	pp512	289.87	266.58	0.92
qwen3 8B Q8_0	8	pp512	318.56	290.63	0.91
qwen3 8B Q8_0	9	pp512	344.41	314.93	0.91
qwen3 8B Q8_0	10	pp512	377.97	342.75	0.91
qwen3 8B Q8_0	11	pp512	416.42	373.85	0.90
qwen3 8B Q8_0	12	pp512	447.61	398.83	0.89
qwen3 8B Q8_0	13	pp512	486.83	429.74	0.88
qwen3 8B Q8_0	14	pp512	525.24	458.88	0.87
qwen3 8B Q8_0	15	pp512	555.91	482.08	0.87
qwen3 8B Q8_0	16	pp512	580.07	512.47	0.88

@JohannesGaessler, looks like that #16988 changes mmf.cu to

for (size_t i = 0; i < GGML_MAX_DIMS; ++i) {
    if (src0_nb[i] % (2*ts) != 0) {
        return false;
    }
}

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

JohannesGaessler · 2025-11-07T22:30:14Z

honestly the performance is lower than expectation.

On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

Yes sorry, that was a bug that I introduced.

zhang-hui-yulo · 2025-11-08T04:25:44Z

honestly the performance is lower than expectation.

On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.

then native mmf won't be excised on my RDNA4, it always uses hipblas path.

Yes sorry, that was a bug that I introduced.

Thank you for the tip, AFAIK, tensor core on RDNA3 uses the same silicon of vector instructions, RDNA4 redesigns the tensor core and makes it more like CDNA.

But at least it shall not be slower than hipblas, I shall spend sometime to find out the root cause, at least I know that hip compiler doesn't acquire register very well.

JohannesGaessler · 2025-11-08T08:22:26Z

Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated ldmatrix instruction can be used to load 4 bytes per thread with groups of 4 threads making a single 16 byte load from shared memory. If you just load 4 byte chunks with the regular indices provided by get_i and get_j you end up with the memory accesses going to only 16 out of (I think) the 32 shared memory banks and you only get 50% of the memory bandwidth.

zhang-hui-yulo · 2025-11-09T04:53:10Z

Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated ldmatrix instruction can be used to load 4 bytes per thread with groups of 4 threads making a single 16 byte load from shared memory. If you just load 4 byte chunks with the regular indices provided by get_i and get_j you end up with the memory accesses going to only 16 out of (I think) the 32 shared memory banks and you only get 50% of the memory bandwidth.

Thank you for the tips, there is little info for AMD bank layout, based on the limited document I have, RDNA3 has 32 banks in CU mode and 54 banks in WGP mode, WGP mode is the default one, the bank width is DWORD, I don't have any doc for RDNA4 so I assume it shall be similar as before, so I don't change any code logic in mmf, just adapter the wmma instruction.

Based on the wmma layout of RDNA4, I just keep the old ldmatrix logic and use vectorized load in load_generic, honestly I'm not sure if shared memory bank conflict is the root cause.

JohannesGaessler · 2025-11-09T07:14:06Z

Are you aware of the AMD ISA documentation?

zhang-hui-yulo · 2025-11-09T08:44:18Z

Are you aware of the AMD ISA documentation?

Honestly, not very much as it isn't friendly for software developer, based on the gemm benchmark on my modified nvidia cute for RDNA3, the bank layout is same as nvidia ampere.

Just have a check on RDNA4 ISA

1.2.2.1. Local Data Share (LDS)

...
Each work-group processor (WGP) has a 128kB memory space that enables low-latency communication
 between work-items within a work-group, or the work-items within a wave; this is the local data share (LDS).
 This memory is configured with 64 banks, each with 512 entries of 4 bytes.

12.1. Overview
 There are 128kB of memory per work-group processor split up into 64 banks of DWORD-wide RAMs. These 64
 banks are further sub-divided into two sets of 32-banks each where 32 of the banks are affiliated with a pair of
 SIMD32’s, and the other 32 banks are affiliated with the other pair of SIMD32’s within the WGP. Each bank is a
 512x32 two-port RAM (1R/1W per clock cycle). DWORDs are placed in the banks serially, but all banks can
 execute a store or load simultaneously. One work-group can request up to 64kB memory.

So, it's 32 banks in CU mode and 64 banks in WGP mode, the bank width is 4 bytes.

zhang-hui-yulo · 2025-11-09T09:07:09Z

I think I've found the root cause, mat_mmf_f is slower than hipblas, mat_mmf_ids is better than hipblas.

MUL_MAT

Backend	GGML op	Op parameters	TFLOPS master	TFLOPS mmf_wmma_rdna4	Speedup
ROCm0	MUL_MAT	type_a=bf16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	0.61	0.61	1.00
ROCm0	MUL_MAT	type_a=bf16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.20	1.20	1.00
ROCm0	MUL_MAT	type_a=bf16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.65	1.65	1.00
ROCm0	MUL_MAT	type_a=bf16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.98	1.79	0.90
ROCm0	MUL_MAT	type_a=bf16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.50	2.18	0.87
ROCm0	MUL_MAT	type_a=bf16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	92.19	93.01	1.01
ROCm0	MUL_MAT	type_a=bf16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.92	3.22	0.82
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=128,n=1,k=16416,bs=[8,1],nr=[4,1],per=[0,1,2,3],k_v=32832,o=1	1.37	1.37	1.00
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=16416,n=1,k=128,bs=[8,1],nr=[4,1],per=[0,2,1,3],k_v=0,o=1	0.34	0.33	0.98
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	0.61	0.61	1.00
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.21	1.21	1.00
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.77	1.77	1.00
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.27	2.27	1.00
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.67	2.67	1.00
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	95.79	95.64	1.00
ROCm0	MUL_MAT	type_a=f16,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.03	3.25	0.81
ROCm0	MUL_MAT	type_a=f32,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	0.31	0.31	1.00
ROCm0	MUL_MAT	type_a=f32,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	0.63	0.63	1.00
ROCm0	MUL_MAT	type_a=f32,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	0.94	0.94	1.00
ROCm0	MUL_MAT	type_a=f32,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.25	1.25	1.00
ROCm0	MUL_MAT	type_a=f32,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.54	1.54	1.00
ROCm0	MUL_MAT	type_a=f32,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.47	3.47	1.00
ROCm0	MUL_MAT	type_a=f32,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.30	2.30	1.00
ROCm0	MUL_MAT	type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.64	3.65	1.00
ROCm0	MUL_MAT	type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.94	5.93	1.00
ROCm0	MUL_MAT	type_a=iq1_m,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.00	6.97	1.00
ROCm0	MUL_MAT	type_a=iq1_m,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.84	7.83	1.00
ROCm0	MUL_MAT	type_a=iq1_m,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.19	8.26	1.01
ROCm0	MUL_MAT	type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	74.72	74.74	1.00
ROCm0	MUL_MAT	type_a=iq1_m,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.37	9.34	1.00
ROCm0	MUL_MAT	type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.13	4.13	1.00
ROCm0	MUL_MAT	type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.92	6.90	1.00
ROCm0	MUL_MAT	type_a=iq1_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.43	7.40	1.00
ROCm0	MUL_MAT	type_a=iq1_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.22	8.21	1.00
ROCm0	MUL_MAT	type_a=iq1_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.84	8.91	1.01
ROCm0	MUL_MAT	type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	75.09	75.23	1.00
ROCm0	MUL_MAT	type_a=iq1_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.03	8.91	0.99
ROCm0	MUL_MAT	type_a=iq2_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.61	1.60	1.00
ROCm0	MUL_MAT	type_a=iq2_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.85	2.84	1.00
ROCm0	MUL_MAT	type_a=iq2_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.97	3.97	1.00
ROCm0	MUL_MAT	type_a=iq2_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.80	4.80	1.00
ROCm0	MUL_MAT	type_a=iq2_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.09	5.08	1.00
ROCm0	MUL_MAT	type_a=iq2_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	74.15	74.19	1.00
ROCm0	MUL_MAT	type_a=iq2_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.53	6.52	1.00
ROCm0	MUL_MAT	type_a=iq2_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.20	2.20	1.00
ROCm0	MUL_MAT	type_a=iq2_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.85	3.84	1.00
ROCm0	MUL_MAT	type_a=iq2_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.05	5.05	1.00
ROCm0	MUL_MAT	type_a=iq2_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.88	5.87	1.00
ROCm0	MUL_MAT	type_a=iq2_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.62	6.61	1.00
ROCm0	MUL_MAT	type_a=iq2_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	74.19	74.25	1.00
ROCm0	MUL_MAT	type_a=iq2_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.79	7.79	1.00
ROCm0	MUL_MAT	type_a=iq2_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.66	1.66	1.00
ROCm0	MUL_MAT	type_a=iq2_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.99	2.99	1.00
ROCm0	MUL_MAT	type_a=iq2_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.16	4.16	1.00
ROCm0	MUL_MAT	type_a=iq2_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.04	5.04	1.00
ROCm0	MUL_MAT	type_a=iq2_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.82	5.82	1.00
ROCm0	MUL_MAT	type_a=iq2_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	74.48	74.52	1.00
ROCm0	MUL_MAT	type_a=iq2_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.90	6.90	1.00
ROCm0	MUL_MAT	type_a=iq3_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.56	1.56	1.00
ROCm0	MUL_MAT	type_a=iq3_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.88	2.87	1.00
ROCm0	MUL_MAT	type_a=iq3_s,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.04	4.04	1.00
ROCm0	MUL_MAT	type_a=iq3_s,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.02	5.01	1.00
ROCm0	MUL_MAT	type_a=iq3_s,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.86	5.84	1.00
ROCm0	MUL_MAT	type_a=iq3_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	73.34	73.21	1.00
ROCm0	MUL_MAT	type_a=iq3_s,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.06	7.04	1.00
ROCm0	MUL_MAT	type_a=iq3_xxs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.14	2.13	1.00
ROCm0	MUL_MAT	type_a=iq3_xxs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.79	3.79	1.00
ROCm0	MUL_MAT	type_a=iq3_xxs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.08	5.08	1.00
ROCm0	MUL_MAT	type_a=iq3_xxs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.81	5.80	1.00
ROCm0	MUL_MAT	type_a=iq3_xxs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.85	6.84	1.00
ROCm0	MUL_MAT	type_a=iq3_xxs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	73.37	73.46	1.00
ROCm0	MUL_MAT	type_a=iq3_xxs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.69	7.68	1.00
ROCm0	MUL_MAT	type_a=iq4_nl,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.79	3.79	1.00
ROCm0	MUL_MAT	type_a=iq4_nl,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.48	5.48	1.00
ROCm0	MUL_MAT	type_a=iq4_nl,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.24	7.21	1.00
ROCm0	MUL_MAT	type_a=iq4_nl,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.77	8.70	0.99
ROCm0	MUL_MAT	type_a=iq4_nl,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.54	8.47	0.99
ROCm0	MUL_MAT	type_a=iq4_nl,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	72.52	72.51	1.00
ROCm0	MUL_MAT	type_a=iq4_nl,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.25	9.22	1.00
ROCm0	MUL_MAT	type_a=iq4_xs,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.76	3.76	1.00
ROCm0	MUL_MAT	type_a=iq4_xs,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.54	6.52	1.00
ROCm0	MUL_MAT	type_a=iq4_xs,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.94	8.93	1.00
ROCm0	MUL_MAT	type_a=iq4_xs,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.69	9.66	1.00
ROCm0	MUL_MAT	type_a=iq4_xs,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	10.25	10.21	1.00
ROCm0	MUL_MAT	type_a=iq4_xs,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	72.60	72.56	1.00
ROCm0	MUL_MAT	type_a=iq4_xs,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	10.13	10.11	1.00
ROCm0	MUL_MAT	type_a=mxfp4,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.56	3.56	1.00
ROCm0	MUL_MAT	type_a=mxfp4,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.08	5.05	0.99
ROCm0	MUL_MAT	type_a=mxfp4,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.15	7.12	1.00
ROCm0	MUL_MAT	type_a=mxfp4,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.53	8.54	1.00
ROCm0	MUL_MAT	type_a=mxfp4,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.72	8.70	1.00
ROCm0	MUL_MAT	type_a=mxfp4,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	73.92	73.95	1.00
ROCm0	MUL_MAT	type_a=mxfp4,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.13	9.13	1.00
ROCm0	MUL_MAT	type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.86	2.85	1.00
ROCm0	MUL_MAT	type_a=q2_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.69	3.68	1.00
ROCm0	MUL_MAT	type_a=q2_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.02	4.00	1.00
ROCm0	MUL_MAT	type_a=q2_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.20	4.19	1.00
ROCm0	MUL_MAT	type_a=q2_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.32	4.30	1.00
ROCm0	MUL_MAT	type_a=q2_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	72.18	72.15	1.00
ROCm0	MUL_MAT	type_a=q2_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.36	4.36	1.00
ROCm0	MUL_MAT	type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.71	1.71	1.00
ROCm0	MUL_MAT	type_a=q3_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.82	2.81	1.00
ROCm0	MUL_MAT	type_a=q3_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.41	3.41	1.00
ROCm0	MUL_MAT	type_a=q3_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.84	3.83	1.00
ROCm0	MUL_MAT	type_a=q3_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.10	4.09	1.00
ROCm0	MUL_MAT	type_a=q3_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	68.74	68.69	1.00
ROCm0	MUL_MAT	type_a=q3_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.41	4.41	1.00
ROCm0	MUL_MAT	type_a=q4_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.02	4.02	1.00
ROCm0	MUL_MAT	type_a=q4_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.47	5.47	1.00
ROCm0	MUL_MAT	type_a=q4_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.35	7.33	1.00
ROCm0	MUL_MAT	type_a=q4_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.74	8.73	1.00
ROCm0	MUL_MAT	type_a=q4_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.64	8.60	0.99
ROCm0	MUL_MAT	type_a=q4_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	72.55	72.71	1.00
ROCm0	MUL_MAT	type_a=q4_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.50	9.44	0.99
ROCm0	MUL_MAT	type_a=q4_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.95	3.95	1.00
ROCm0	MUL_MAT	type_a=q4_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.88	6.86	1.00
ROCm0	MUL_MAT	type_a=q4_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.59	7.57	1.00
ROCm0	MUL_MAT	type_a=q4_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.02	9.00	1.00
ROCm0	MUL_MAT	type_a=q4_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.84	8.68	0.98
ROCm0	MUL_MAT	type_a=q4_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	72.18	72.13	1.00
ROCm0	MUL_MAT	type_a=q4_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.76	9.74	1.00
ROCm0	MUL_MAT	type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.70	2.70	1.00
ROCm0	MUL_MAT	type_a=q4_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.69	3.68	1.00
ROCm0	MUL_MAT	type_a=q4_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.05	4.05	1.00
ROCm0	MUL_MAT	type_a=q4_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.24	4.23	1.00
ROCm0	MUL_MAT	type_a=q4_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.37	4.35	1.00
ROCm0	MUL_MAT	type_a=q4_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	72.00	71.93	1.00
ROCm0	MUL_MAT	type_a=q4_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.57	4.57	1.00
ROCm0	MUL_MAT	type_a=q5_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.11	3.10	1.00
ROCm0	MUL_MAT	type_a=q5_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.22	5.21	1.00
ROCm0	MUL_MAT	type_a=q5_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.46	6.46	1.00
ROCm0	MUL_MAT	type_a=q5_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.80	7.79	1.00
ROCm0	MUL_MAT	type_a=q5_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.18	8.08	0.99
ROCm0	MUL_MAT	type_a=q5_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	62.94	62.84	1.00
ROCm0	MUL_MAT	type_a=q5_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.89	8.88	1.00
ROCm0	MUL_MAT	type_a=q5_1,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.36	3.35	1.00
ROCm0	MUL_MAT	type_a=q5_1,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.59	5.59	1.00
ROCm0	MUL_MAT	type_a=q5_1,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.04	7.02	1.00
ROCm0	MUL_MAT	type_a=q5_1,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.05	8.05	1.00
ROCm0	MUL_MAT	type_a=q5_1,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.77	8.75	1.00
ROCm0	MUL_MAT	type_a=q5_1,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	63.41	63.49	1.00
ROCm0	MUL_MAT	type_a=q5_1,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	9.54	9.54	1.00
ROCm0	MUL_MAT	type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.53	2.53	1.00
ROCm0	MUL_MAT	type_a=q5_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.53	3.53	1.00
ROCm0	MUL_MAT	type_a=q5_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.90	3.89	1.00
ROCm0	MUL_MAT	type_a=q5_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.13	4.12	1.00
ROCm0	MUL_MAT	type_a=q5_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.29	4.29	1.00
ROCm0	MUL_MAT	type_a=q5_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	69.41	69.38	1.00
ROCm0	MUL_MAT	type_a=q5_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.50	4.49	1.00
ROCm0	MUL_MAT	type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	1.84	1.83	1.00
ROCm0	MUL_MAT	type_a=q6_K,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.91	2.90	1.00
ROCm0	MUL_MAT	type_a=q6_K,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	3.66	3.66	1.00
ROCm0	MUL_MAT	type_a=q6_K,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.15	4.15	1.00
ROCm0	MUL_MAT	type_a=q6_K,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.51	4.50	1.00
ROCm0	MUL_MAT	type_a=q6_K,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	69.82	69.80	1.00
ROCm0	MUL_MAT	type_a=q6_K,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	5.16	5.16	1.00
ROCm0	MUL_MAT	type_a=q8_0,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	2.52	2.53	1.00
ROCm0	MUL_MAT	type_a=q8_0,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	4.63	4.63	1.00
ROCm0	MUL_MAT	type_a=q8_0,type_b=f32,m=4096,n=3,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	6.44	6.43	1.00
ROCm0	MUL_MAT	type_a=q8_0,type_b=f32,m=4096,n=4,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.39	7.38	1.00
ROCm0	MUL_MAT	type_a=q8_0,type_b=f32,m=4096,n=5,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	7.80	7.77	1.00
ROCm0	MUL_MAT	type_a=q8_0,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	68.38	68.34	1.00
ROCm0	MUL_MAT	type_a=q8_0,type_b=f32,m=4096,n=8,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1	8.03	8.04	1.00

MUL_MAT_ID

Backend	GGML op	Op parameters	TFLOPS master	TFLOPS `0ec241d`	Speedup
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1	0.77	0.77	1.00
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1	1.46	4.32	2.96
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1	2.75	4.59	1.67
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1	0.38	1.37	3.60
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1	0.13	0.66	4.90
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1	5.19	5.34	1.03
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1	0.73	2.41	3.29
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1	0.16	0.80	4.88
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1	0.78	0.77	0.99
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1	4.68	5.85	1.25
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1	8.87	8.80	0.99
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1	1.29	2.45	1.89
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1	0.36	0.75	2.08
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1	13.96	13.94	1.00
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1	2.51	4.32	1.72
ROCm0	MUL_MAT_ID	type_a=f16,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1	0.44	0.93	2.13
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1	0.67	0.68	1.01
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1	0.66	0.65	0.99
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1	0.90	0.90	0.99
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1	0.29	0.30	1.03
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1	0.13	0.13	0.97
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1	1.65	1.65	1.00
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1	0.47	0.47	0.99
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1	0.16	0.16	1.04
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1	0.73	0.73	1.00
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1	1.62	1.61	1.00
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1	2.73	2.70	0.99
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1	0.90	0.90	1.00
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1	0.25	0.27	1.06
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1	2.48	2.57	1.04
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1	1.09	1.05	0.96
ROCm0	MUL_MAT_ID	type_a=f32,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1	0.35	0.38	1.09
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1	1.49	1.48	1.00
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1	1.30	1.25	0.96
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1	1.59	1.57	0.99
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1	1.48	1.53	1.04
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1	0.96	0.97	1.01
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1	2.40	2.39	0.99
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1	1.18	1.20	1.02
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1	1.24	1.32	1.06
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1	1.60	1.60	1.00
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1	3.70	3.73	1.01
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1	5.57	5.60	1.00
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1	2.62	2.64	1.01
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1	1.30	1.14	0.88
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1	9.29	9.31	1.00
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1	2.70	2.76	1.02
ROCm0	MUL_MAT_ID	type_a=iq2_xs,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1	1.46	1.61	1.10
ROCm0	MUL_MAT_ID	type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=1,k=2880,o=1	2.87	2.86	1.00
ROCm0	MUL_MAT_ID	type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=4,k=2880,o=1	1.80	1.38	0.77
ROCm0	MUL_MAT_ID	type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=512,k=2880,o=1	15.45	15.13	0.98
ROCm0	MUL_MAT_ID	type_a=mxfp4,type_b=f32,n_mats=32,n_used=4,b=0,m=2880,n=8,k=2880,o=1	1.92	2.02	1.05
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1	1.13	1.13	1.00
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1	1.51	1.44	0.95
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1	1.85	1.85	1.00
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1	2.00	2.03	1.02
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1	1.54	1.76	1.14
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1	2.88	2.86	0.99
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1	1.23	1.27	1.03
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1	2.02	2.23	1.10
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1	2.15	2.15	1.00
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1	4.03	3.95	0.98
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1	6.40	6.37	1.00
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1	3.61	3.68	1.02
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1	1.79	1.69	0.94
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1	9.89	10.01	1.01
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1	2.95	2.83	0.96
ROCm0	MUL_MAT_ID	type_a=q4_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1	2.85	2.60	0.91
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1	1.02	1.02	1.00
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1	1.25	1.23	0.98
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1	1.73	1.73	1.00
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1	1.63	1.72	1.06
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1	1.64	1.50	0.92
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1	2.25	2.30	1.02
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1	0.98	0.99	1.01
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1	1.90	1.98	1.04
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1	1.80	1.80	1.00
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1	3.71	3.77	1.02
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1	5.07	4.91	0.97
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1	2.87	2.92	1.02
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1	1.96	1.90	0.97
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1	8.40	7.68	0.91
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1	2.46	2.40	0.98
ROCm0	MUL_MAT_ID	type_a=q4_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1	2.23	2.54	1.14
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1	1.29	1.29	1.00
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1	1.21	1.25	1.03
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1	1.43	1.41	0.99
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1	1.21	1.20	0.99
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1	1.11	1.12	1.01
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1	1.73	1.71	0.99
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1	1.05	1.08	1.03
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1	1.42	1.40	0.98
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1	1.40	1.39	0.99
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1	3.22	3.21	1.00
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1	3.86	3.89	1.01
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1	2.18	2.15	0.99
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1	1.30	1.35	1.04
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1	5.08	5.00	0.98
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1	2.33	2.18	0.94
ROCm0	MUL_MAT_ID	type_a=q6_K,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1	1.67	1.89	1.13
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=1,k=2048,o=1	1.06	1.06	1.00
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=128,k=2048,o=1	1.40	1.48	1.05
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=256,k=2048,o=1	1.83	1.80	0.98
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1	1.75	1.91	1.09
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1	1.39	1.32	0.95
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=512,k=2048,o=1	2.75	2.75	1.00
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=64,k=2048,o=1	1.13	1.09	0.97
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=8,k=2048,o=1	1.24	1.21	0.98
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=1,k=2048,o=1	1.81	1.81	1.00
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=128,k=2048,o=1	3.74	3.69	0.99
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=256,k=2048,o=1	5.90	5.82	0.99
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=32,k=2048,o=1	3.06	3.10	1.01
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=4,k=2048,o=1	1.36	1.55	1.13
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=512,k=2048,o=1	9.40	9.28	0.99
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=64,k=2048,o=1	2.77	2.66	0.96
ROCm0	MUL_MAT_ID	type_a=q8_0,type_b=f32,n_mats=32,n_used=4,b=0,m=1792,n=8,k=2048,o=1	1.31	2.01	1.53

JohannesGaessler · 2025-11-09T10:01:46Z

How about this: for now we move towards merging this PR but only enable it for MUL_MAT_ID where it's already faster, if in the future it also becomes faster for MUL_MAT we can then enable it for that as well.

zhang-hui-yulo · 2025-11-09T10:58:56Z

How about this: for now we move towards merging this PR but only enable it for MUL_MAT_ID where it's already faster, if in the future it also becomes faster for MUL_MAT we can then enable it for that as well.

Thank you for the support, this is also what I'm thinking, just disable mul_mat_f on RDNA4 first and try to rewrite a RDNA4 optimized version in the future.

Also I presume that hip compiler would generate better code on RDNA3 than RDNA4, I will have a test on my 7900XTX next week.

Anyway, could youplease review it first? One thing that hip compiler cannot handle early return code, it will still compile the code after return

if  constexpr (rdna_not_supported) {
    NO_DEVICE_CODE;
    return;
}

// hip compiler will still compile this
rdna unsupported code like tile<16, 8, float> tile_A

Also, I don't see performance improvement with real model like llama3-8b-fp16 and deepseek-r1-8b-fp16 with batch 1~16, looks like that I need to do the test with batch 512, right?

JohannesGaessler · 2025-11-09T15:14:22Z

@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts.

zhang-hui-yulo · 2025-11-10T01:49:59Z

@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts.

Hello @JohannesGaessler, as I'm still not very familiar with llama.cpp internal code, I think my schedule shall be

porting MMF to RDNA3, keep the original logic to see if the performance is good enough.
porting FA to RDNA4, keep the original logic to see if the performance is good enough.
better MMF or FA for RDNA4 or RDNA3.

I will start them once this PR is approved.

Also I suggest you to put FA on RDNA3 to low priority as RDNA3 wmma isn't suitable for gemm fusion, you need shared memory to rearrange the layout for D matrix of QK.

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/mma.cuh

ggml/src/ggml-cuda/mmf.cuh

JohannesGaessler · 2025-11-10T15:50:59Z

@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout.

zhang-hui-yulo · 2025-11-11T03:20:26Z

@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout.

I agree, please move forward first, I will do it based on your changes, anyway I still need sometime to find a good way to support C = B * A for RDNA4 and CDNA3, maybe add a new tile class is enough

template <int I, int J, typename T, bool trans>
class tle : tile <I, J, T> {
    int get_i () {
        if (!trans) {
            return tile <I, J, T>::get_i();
        } else {
            return tile <I, J, T>::get_j();
        }
    }
}

zhang hui added 2 commits November 7, 2025 21:22

mmf for rdna4

2f7cfcf

align the padding for rdna4

d564a35

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 7, 2025

DajanaV mentioned this pull request Nov 7, 2025

UPSTREAM PR #17077: HIP: RDNA4 tensor core support for MMF auroralabs-loci/llama.cpp#118

Open

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

0ec241d

forbit mul_mat_f for rdna4

bbee5fe

zhang-hui-yulo marked this pull request as ready for review November 9, 2025 10:58

zhang-hui-yulo requested review from JohannesGaessler, am17an and slaren as code owners November 9, 2025 10:59

JohannesGaessler reviewed Nov 10, 2025

View reviewed changes

zhang-hui-yulo and others added 4 commits November 11, 2025 13:43

Merge branch 'ggml-org:master' into mmf_wmma_rdna4

6b8ceeb

fix as comment

fd18344

remove device kernels

7a09e22

add constexpr for early return

c65dd59

HIP: RDNA4 tensor core support for MMF #17077

Are you sure you want to change the base?

HIP: RDNA4 tensor core support for MMF #17077

Conversation

zhang-hui-yulo commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Nov 7, 2025

Uh oh!

zhang-hui-yulo commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Nov 8, 2025

Uh oh!

zhang-hui-yulo commented Nov 9, 2025

Uh oh!

JohannesGaessler commented Nov 9, 2025

Uh oh!

zhang-hui-yulo commented Nov 9, 2025

Uh oh!

zhang-hui-yulo commented Nov 9, 2025

Uh oh!

JohannesGaessler commented Nov 9, 2025

Uh oh!

zhang-hui-yulo commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Nov 9, 2025

Uh oh!

zhang-hui-yulo commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Nov 10, 2025

Uh oh!

zhang-hui-yulo commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhang-hui-yulo commented Nov 7, 2025 •

edited

Loading

zhang-hui-yulo commented Nov 8, 2025 •

edited

Loading

zhang-hui-yulo commented Nov 9, 2025 •

edited

Loading

zhang-hui-yulo commented Nov 10, 2025 •

edited

Loading

zhang-hui-yulo commented Nov 11, 2025 •

edited

Loading