[MXFP8 MoE] What's the expected inference solution on H100s, after training with TorchAO MXFP8 MoE?

Hi team,

Thanks for your great implementation of the new MXFP8 MoE! I have integrated it and consider to use it for prod training.
But I got a concern about how to do inference.

MXFP8 is only available on B200. What is the expected inference solution on H100 or even non-Nvidia GPUs after training with MXFP8. Other quantizations, even another FP8 quantization, is not guaranteed to work well with the model trained with MXFP8.

Is a QAT finetuning with another quantization method expected?
Should we just inference with another quantization method without finetuning?

I guess FP4 training is a similar case. 

I think the question is not only to TorchAO team. Anyone please share your ideas/insights if you would like to.

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MXFP8 MoE] What's the expected inference solution on H100s, after training with TorchAO MXFP8 MoE? #3305

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[MXFP8 MoE] What's the expected inference solution on H100s, after training with TorchAO MXFP8 MoE? #3305

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions