Today, Float8WeightOnlyConfig maps to a reference implementation of weight-only quant, which dequantized the tensor and then runs a high precision gemm:
|
return torch.nn.functional.linear( |
Users have reported confusion about this, we should either clearly explain that no speedup is expected or map to a fast kernel.