Skip to content

[audio] weight_norm standardization #42064

@eustlb

Description

@eustlb

Important

DRAFT: This issue is provided for visibility, the below recommendations will evolve

This issue serves as a tracker to standardise the usage of weight_norm throughout the library for our audio models and establish good practices. Different approaches

  1. conversion time: remove weight norm once when converting the weights
  2. inference time: remove weight norm at init, meaning the loaded stated_dict is the one with weight norm weight norm is removed at init
  3. do not remove weight norm, meaning the correct weight is recomputed each time

A summary of how it is done currently throughout the lib:

Model In 🤗 Transformers Original Codebase / Source Project
dac conversion time inference time (not removed)
encodec inference time (not removed) inference time (not removed)
fastspeech2_conformer NA (copied from) NA (copied from)
hubert inference time (not removed) inference time (not removed)
mimi likely weights have been converted likely weights have been converted for the inference version
seamless_m4t NA (copied from) NA (copied from)
seamless_m4t_v2 NA (copied from) NA (copied from)
sew inference time (not removed)
sew_d inference time (not removed)
speecht5 inference time (not removed)
unispeech inference time (not removed)
unispeech_sat inference time (not removed)
univnet conversion time
vits inference time (not removed)
wav2vec2 inference time (not removed)
wav2vec2_conformer inference time (not removed)
wavlm inference time (not removed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions