Hi team,
Thanks for your great implementation of the new MXFP8 MoE! I have integrated it and consider to use it for prod training.
But I got a concern about how to do inference.
MXFP8 is only available on B200. What is the expected inference solution on H100 or even non-Nvidia GPUs after training with MXFP8. Other quantizations, even another FP8 quantization, is not guaranteed to work well with the model trained with MXFP8.
Is a QAT finetuning with another quantization method expected?
Should we just inference with another quantization method without finetuning?
I guess FP4 training is a similar case.
I think the question is not only to TorchAO team. Anyone please share your ideas/insights if you would like to.
Thanks in advance!