-
Notifications
You must be signed in to change notification settings - Fork 13.6k
HIP: RDNA4 tensor core support for MMF #17077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.
Yes sorry, that was a bug that I introduced. |
Thank you for the tip, AFAIK, tensor core on RDNA3 uses the same silicon of vector instructions, RDNA4 redesigns the tensor core and makes it more like CDNA. But at least it shall not be slower than hipblas, I shall spend sometime to find out the root cause, at least I know that hip compiler doesn't acquire register very well. |
|
Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated |
Thank you for the tips, there is little info for AMD bank layout, based on the limited document I have, RDNA3 has 32 banks in CU mode and 54 banks in WGP mode, WGP mode is the default one, the bank width is DWORD, I don't have any doc for RDNA4 so I assume it shall be similar as before, so I don't change any code logic in mmf, just adapter the wmma instruction. Based on the wmma layout of RDNA4, I just keep the old ldmatrix logic and use vectorized load in load_generic, honestly I'm not sure if shared memory bank conflict is the root cause. |
|
Are you aware of the AMD ISA documentation? |
Honestly, not very much as it isn't friendly for software developer, based on the gemm benchmark on my modified nvidia cute for RDNA3, the bank layout is same as nvidia ampere. Just have a check on RDNA4 ISA So, it's 32 banks in CU mode and 64 banks in WGP mode, the bank width is 4 bytes. |
|
I think I've found the root cause, mat_mmf_f is slower than hipblas, mat_mmf_ids is better than hipblas. MUL_MAT
MUL_MAT_ID
|
|
How about this: for now we move towards merging this PR but only enable it for |
Thank you for the support, this is also what I'm thinking, just disable mul_mat_f on RDNA4 first and try to rewrite a RDNA4 optimized version in the future. Also I presume that hip compiler would generate better code on RDNA3 than RDNA4, I will have a test on my 7900XTX next week. Anyway, could youplease review it first? One thing that hip compiler cannot handle early return code, it will still compile the code after return Also, I don't see performance improvement with real model like llama3-8b-fp16 and deepseek-r1-8b-fp16 with batch 1~16, looks like that I need to do the test with batch 512, right? |
|
@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts. |
Hello @JohannesGaessler, as I'm still not very familiar with llama.cpp internal code, I think my schedule shall be
I will start them once this PR is approved. Also I suggest you to put FA on RDNA3 to low priority as RDNA3 wmma isn't suitable for gemm fusion, you need shared memory to rearrange the layout for D matrix of QK. |
|
@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout. |
I agree, please move forward first, I will do it based on your changes, anyway I still need sometime to find a good way to support C = B * A for RDNA4 and CDNA3, maybe add a new tile class is enough |
Add RDNA4 tensor core support for MMF, honestly the performance is lower than expectation. The model is at https://huggingface.co/Mungert/DeepSeek-R1-0528-Qwen3-8B-GGUF
@JohannesGaessler, looks like that #16988 changes mmf.cu to
then native mmf won't be excised on my RDNA4, it always uses hipblas path.