[audio] weight_norm standardization

> [!IMPORTANT]
> **DRAFT:** This issue is provided for visibility, the below recommendations will evolve

This issue serves as a tracker to standardise the usage of weight_norm throughout the library for our audio models and establish good practices. Different approaches
1. conversion time: remove weight norm once when converting the weights
2. inference time: remove weight norm at init, meaning the loaded stated_dict is the one with weight norm weight norm is removed at init
3. do not remove weight norm, meaning the correct weight is recomputed each time

A summary of how it is done currently throughout the lib:

| **Model**             | **In 🤗 Transformers**             | **Original Codebase / Source Project**                       |
| --------------------- | ---------------------------------- | ------------------------------------------------------------ |
| dac                   | conversion time                    | inference time (not removed)                                 |
| encodec               | inference time (not removed)       | inference time (not removed)                                 |
| fastspeech2_conformer | NA (copied from)                   | NA (copied from)                                             |
| hubert                | inference time (not removed)       | inference time (not removed)                                 |
| mimi                  | likely weights have been converted | likely weights have been converted for the inference version |
| seamless_m4t          | NA (copied from)                   | NA (copied from)                                             |
| seamless_m4t_v2       | NA (copied from)                   | NA (copied from)                                             |
| sew                   | inference time (not removed)       |                                                              |
| sew_d                 | inference time (not removed)       |                                                              |
| speecht5              | inference time (not removed)       |                                                              |
| unispeech             | inference time (not removed)       |                                                              |
| unispeech_sat         | inference time (not removed)       |                                                              |
| univnet               | conversion time                    |                                                              |
| vits                  | inference time (not removed)       |                                                              |
| wav2vec2              | inference time (not removed)       |                                                              |
| wav2vec2_conformer    | inference time (not removed)       |                                                              |
| wavlm                 | inference time (not removed)       |                                                              |



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[audio] weight_norm standardization #42064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	In 🤗 Transformers	Original Codebase / Source Project
dac	conversion time	inference time (not removed)
encodec	inference time (not removed)	inference time (not removed)
fastspeech2_conformer	NA (copied from)	NA (copied from)
hubert	inference time (not removed)	inference time (not removed)
mimi	likely weights have been converted	likely weights have been converted for the inference version
seamless_m4t	NA (copied from)	NA (copied from)
seamless_m4t_v2	NA (copied from)	NA (copied from)
sew	inference time (not removed)
sew_d	inference time (not removed)
speecht5	inference time (not removed)
unispeech	inference time (not removed)
unispeech_sat	inference time (not removed)
univnet	conversion time
vits	inference time (not removed)
wav2vec2	inference time (not removed)
wav2vec2_conformer	inference time (not removed)
wavlm	inference time (not removed)

[audio] weight_norm standardization #42064

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions