Skip to content

Commit beb1d17

Browse files
authored
Add support for Voxtral (#1373)
1 parent 9407ed8 commit beb1d17

File tree

6 files changed

+66
-1
lines changed

6 files changed

+66
-1
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -441,6 +441,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
441441
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://huggingface.co/papers/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
442442
1. **[ViTPose](https://huggingface.co/docs/transformers/model_doc/vitpose)** (from The University of Sydney) released with the paper [ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation](https://huggingface.co/papers/2204.12484) by Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao.
443443
1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://huggingface.co/papers/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
444+
1. **[Voxtral](https://huggingface.co/docs/transformers/model_doc/voxtral)** (from Mistral AI) released with the paper [Voxtral](https://huggingface.co/papers/2507.13264) by Alexander H. Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Sanchit Gandhi, Soham Ghosh, Srijan Mishra, Thomas Foubert, Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Clémence Lanfranchi, Darius Dabert, Devendra Singh Chaplot, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gabrielle Berrada, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jason Rute, Jean-Hadrien Chabran, Jessica Chudnovsky, Joachim Studnia, Joep Barmentlo, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Karmesh Yadav, Kartik Khandelwal, Kush Jain, Lélio Renard Lavaud, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Matthieu Dinot, Maxime Darrin, Maximilian Augustin, Mickaël Seznec, Neha Gupta, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Rémi Delacourt, Romain Sauvestre, Roman Soletskyi, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Shashwat Dalal, Siddharth Gandhi, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timothée Lacroix, Tom Bewley, Valeriia Nemychnikova, Victor Paltz , Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yihan Wan, Yunhao Tang.
444445
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://huggingface.co/papers/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
445446
1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/main/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
446447
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://huggingface.co/papers/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.

docs/snippets/6_supported-models.snippet

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,7 @@
155155
1. **[ViTMSN](https://huggingface.co/docs/transformers/model_doc/vit_msn)** (from Meta AI) released with the paper [Masked Siamese Networks for Label-Efficient Learning](https://huggingface.co/papers/2204.07141) by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.
156156
1. **[ViTPose](https://huggingface.co/docs/transformers/model_doc/vitpose)** (from The University of Sydney) released with the paper [ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation](https://huggingface.co/papers/2204.12484) by Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao.
157157
1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://huggingface.co/papers/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
158+
1. **[Voxtral](https://huggingface.co/docs/transformers/model_doc/voxtral)** (from Mistral AI) released with the paper [Voxtral](https://huggingface.co/papers/2507.13264) by Alexander H. Liu, Andy Ehrenberg, Andy Lo, Clément Denoix, Corentin Barreau, Guillaume Lample, Jean-Malo Delignon, Khyathi Raghavi Chandu, Patrick von Platen, Pavankumar Reddy Muddireddy, Sanchit Gandhi, Soham Ghosh, Srijan Mishra, Thomas Foubert, Abhinav Rastogi, Adam Yang, Albert Q. Jiang, Alexandre Sablayrolles, Amélie Héliou, Amélie Martin, Anmol Agarwal, Antoine Roux, Arthur Darcet, Arthur Mensch, Baptiste Bout, Baptiste Rozière, Baudouin De Monicault, Chris Bamford, Christian Wallenwein, Christophe Renaudin, Clémence Lanfranchi, Darius Dabert, Devendra Singh Chaplot, Devon Mizelle, Diego de las Casas, Elliot Chane-Sane, Emilien Fugier, Emma Bou Hanna, Gabrielle Berrada, Gauthier Delerce, Gauthier Guinet, Georgii Novikov, Guillaume Martin, Himanshu Jaju, Jan Ludziejewski, Jason Rute, Jean-Hadrien Chabran, Jessica Chudnovsky, Joachim Studnia, Joep Barmentlo, Jonas Amar, Josselin Somerville Roberts, Julien Denize, Karan Saxena, Karmesh Yadav, Kartik Khandelwal, Kush Jain, Lélio Renard Lavaud, Léonard Blier, Lingxiao Zhao, Louis Martin, Lucile Saulnier, Luyu Gao, Marie Pellat, Mathilde Guillaumin, Mathis Felardos, Matthieu Dinot, Maxime Darrin, Maximilian Augustin, Mickaël Seznec, Neha Gupta, Nikhil Raghuraman, Olivier Duchenne, Patricia Wang, Patryk Saffer, Paul Jacob, Paul Wambergue, Paula Kurylowicz, Philomène Chagniot, Pierre Stock, Pravesh Agrawal, Rémi Delacourt, Romain Sauvestre, Roman Soletskyi, Sagar Vaze, Sandeep Subramanian, Saurabh Garg, Shashwat Dalal, Siddharth Gandhi, Sumukh Aithal, Szymon Antoniak, Teven Le Scao, Thibault Schueller, Thibaut Lavril, Thomas Robert, Thomas Wang, Timothée Lacroix, Tom Bewley, Valeriia Nemychnikova, Victor Paltz , Virgile Richard, Wen-Ding Li, William Marshall, Xuanyu Zhang, Yihan Wan, Yunhao Tang.
158159
1. **[Wav2Vec2](https://huggingface.co/docs/transformers/model_doc/wav2vec2)** (from Facebook AI) released with the paper [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://huggingface.co/papers/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.
159160
1. **[Wav2Vec2-BERT](https://huggingface.co/docs/transformers/main/model_doc/wav2vec2-bert)** (from Meta AI) released with the paper [Seamless: Multilingual Expressive and Streaming Speech Translation](https://ai.meta.com/research/publications/seamless-multilingual-expressive-and-streaming-speech-translation/) by the Seamless Communication team.
160161
1. **[WavLM](https://huggingface.co/docs/transformers/model_doc/wavlm)** (from Microsoft Research) released with the paper [WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing](https://huggingface.co/papers/2110.13900) by Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Furu Wei.

src/configs.js

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ function getNormalizedConfig(config) {
7272
case 'llava_onevision':
7373
case 'idefics3':
7474
case 'ultravox':
75+
case 'voxtral':
7576
case 'smolvlm':
7677
case 'gemma3n':
7778
// @ts-expect-error TS2339
@@ -129,6 +130,7 @@ function getNormalizedConfig(config) {
129130
mapping['num_layers'] = 'num_hidden_layers';
130131
mapping['hidden_size'] = 'hidden_size';
131132
mapping['num_attention_heads'] = 'num_attention_heads';
133+
mapping['dim_kv'] = 'head_dim';
132134
break;
133135
case 'qwen3':
134136
case 'gemma':

src/models.js

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7407,14 +7407,16 @@ export class UltravoxModel extends UltravoxPreTrainedModel {
74077407

74087408
return default_merge_input_ids_with_audio_features({
74097409
// @ts-ignore
7410-
audio_token_id: this.config.ignore_index,
7410+
audio_token_id: this.config.ignore_index ?? this.config.audio_token_id,
74117411
...kwargs,
74127412
audio_features: reshaped_audio_features,
74137413
})
74147414
}
74157415
}
74167416
//////////////////////////////////////////////////
74177417

7418+
export class VoxtralForConditionalGeneration extends UltravoxModel { }
7419+
74187420
//////////////////////////////////////////////////
74197421
// Mimi models
74207422
export class MimiPreTrainedModel extends PreTrainedModel {
@@ -8024,6 +8026,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
80248026

80258027
const MODEL_FOR_AUDIO_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
80268028
['ultravox', ['UltravoxModel', UltravoxModel]],
8029+
['voxtral', ['VoxtralForConditionalGeneration', VoxtralForConditionalGeneration]],
80278030
]);
80288031

80298032

src/models/processors.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ export * from './sam/processing_sam.js';
1616
export * from './smolvlm/processing_smolvlm.js';
1717
export * from './speecht5/processing_speecht5.js';
1818
export * from './ultravox/processing_ultravox.js';
19+
export * from './voxtral/processing_voxtral.js';
1920
export * from './wav2vec2/processing_wav2vec2.js';
2021
export * from './wav2vec2_with_lm/processing_wav2vec2_with_lm.js';
2122
export * from './whisper/processing_whisper.js';
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import { AutoFeatureExtractor } from "../auto/feature_extraction_auto.js"
2+
import { AutoTokenizer } from "../../tokenizers.js"
3+
import { Processor } from "../../base/processing_utils.js"
4+
import { cat } from "../../utils/tensor.js";
5+
6+
const AUDIO_TOKEN = "[AUDIO]";
7+
const BEGIN_AUDIO_TOKEN = "[BEGIN_AUDIO]";
8+
const NUM_AUDIO_TOKENS = 375;
9+
10+
/**
11+
* Represents a VoxtralProcessor that extracts features from an audio input.
12+
*/
13+
export class VoxtralProcessor extends Processor {
14+
static tokenizer_class = AutoTokenizer
15+
static feature_extractor_class = AutoFeatureExtractor
16+
static uses_processor_config = false;
17+
18+
/**
19+
* @param {string} text The text input to process.
20+
* @param {Float32Array|Float32Array[]} audio The audio input(s) to process.
21+
*/
22+
async _call(text, audio = null, kwargs = {}) {
23+
if (Array.isArray(text)) {
24+
throw new Error("Batched inputs are not supported yet.");
25+
}
26+
27+
const audio_inputs = {};
28+
if (audio) {
29+
if (!text.includes(AUDIO_TOKEN)) {
30+
throw new Error(`The input text does not contain the audio token ${AUDIO_TOKEN}.`);
31+
}
32+
if (!Array.isArray(audio)) {
33+
audio = [audio];
34+
}
35+
const num_audio_tokens = text.split(AUDIO_TOKEN).length - 1;
36+
if (num_audio_tokens !== audio.length) {
37+
throw new Error(`The number of audio inputs (${audio.length}) does not match the number of audio tokens in the text (${num_audio_tokens}).`);
38+
}
39+
const features = (await Promise.all(
40+
audio.map((audio_input) => this.feature_extractor(audio_input, kwargs))
41+
)).map(x => x.input_features);
42+
audio_inputs["audio_values"] = features.length > 1 ? cat(features, 0) : features[0];
43+
44+
text = text.replaceAll(AUDIO_TOKEN, BEGIN_AUDIO_TOKEN + AUDIO_TOKEN.repeat(NUM_AUDIO_TOKENS));
45+
}
46+
47+
const text_inputs = this.tokenizer(text, {
48+
add_special_tokens: false,
49+
...kwargs,
50+
});
51+
52+
return {
53+
...text_inputs,
54+
...audio_inputs,
55+
}
56+
}
57+
}

0 commit comments

Comments
 (0)