Skip to content

[MXFP8 MoE] What's the expected inference solution on H100s, after training with TorchAO MXFP8 MoE? #3305

@goldhuang

Description

@goldhuang

Hi team,

Thanks for your great implementation of the new MXFP8 MoE! I have integrated it and consider to use it for prod training.
But I got a concern about how to do inference.

MXFP8 is only available on B200. What is the expected inference solution on H100 or even non-Nvidia GPUs after training with MXFP8. Other quantizations, even another FP8 quantization, is not guaranteed to work well with the model trained with MXFP8.

Is a QAT finetuning with another quantization method expected?
Should we just inference with another quantization method without finetuning?

I guess FP4 training is a similar case.

I think the question is not only to TorchAO team. Anyone please share your ideas/insights if you would like to.

Thanks in advance!

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions