Skip to content

Commit 1709ed9

Browse files
lashahubebezzameustlb
authored
[models] Add AudioFlamingo3 integration (#40290)
* Audio Flamingo 3 initial integration * Added local Qwen * Moving to AF3 * Loading directly from HF * Formatting * add snapshot_download * Loading from hub * Import gating * Pass audio arrays directly * Remove requires_backend * Move constants to config.json * Remove redundancies * Separate tokenizer, cleaner from_pretrained * Remove LlavaMetaModel * Remove sound tower wrapper * Merged BasicSoundEncoder * Some improvements * Towards AudioFlamingo3 * Migrate LlavaConfig * Merge LlavaMetaForCausalLM into AudioFlamingo3ForConditionalGeneration * Remove redundant lines * Add AudioFlamingo3PreTrainedModel * Unified model.safetensors * Inline MM projector * Tokenizer in root dir * Default processor from_pretrained * Remove tokenizer from modeling * Added types * Cleanup * Docs & license * device handling * Change year * Remove redundant methods * Use BatchFeature * Streamline audio feature handling * Batch inference * Reorder alphabetically * Make style check * Make fixup * Avoid calls to separate functions * Remove forward_tower() * Rename encode_sound to get_audio_features for clarity * Add batch decoding method to AudioFlamingo3Processor * Use tensors instead of lists * Move end embed token eval * Prepare audio_features_mask in the processor * No hardcoded 750 and 3000 * Remove _load_sound_mask completely and use WhisperFeatureExtractor * Compute embeddings separately * MM Projector is audio adaptor * Simplify AudioFlamingo3Config initialization with default encoder_config * Add modular * Clean up * make fixup * Cleanup processing, add params to encoder config * Remove redundant methods * update config references, improve method names, and enhance logging in processor * processor: move FE args to audio_kwargs, use common_kwargs for return_tensors * Qwen-like processor * Simplified AudioFlamingo3Processor * Extract common code from generate() and forward() * Add conversion script for AudioFlamingo3 to Hugging Face format * Use save_pretrained() * Don't overwrite gen config * Use AutoTokenizer and FE to convert the processor * minor formatting * Finalize processor, do token expansion inside * AudioFlamingo3: refactor docs, types, and audio–text feature merge * AudioFlamingo3 Docs * Add AudioFlamingo3Processor to AutoProcessor * Processor tests * Use audio_config instead of encoder_config * Add audio_token_id to config * Cleanup & new keys * Add links * Improved processor * Handle conversational input * Make processing consistent. * Add fallback for no sound token, default left padding. * Cleanup * Replace manual 4D mask with masking_utils; dtype/device from inputs * Text only mode * Finalize processor * Export processor directly * Add push_to_hub to converter * Add model_input_names property to AudioFlamingo3Processor to pass tests * Processor chat template support * Added Jinja processor chat template with audio support * Processor tests * Model tests * Added docs * Don't use common_kwargs in __call__ * Pass 'test_left_padding_compatibility' by never treating padding as content * Updated docs * Cleanup docs * Standardization * Update conversion script weight mapping. * Flatten _build_square_attn_mask * Make style * Small dim and attn mask fix * Fix processor padding side bug * Error handling in converter * Use position_ids * Cleanup generation config * Use precomputed position embeddings in AudioFlamingo3 encoder * Added usage examples * Fix generation config * Integration tests * Simplify modeling and shift part of mask preparation to processor. And update tests. * Updated docs * ASR convenience method * Fixed tests * make fixup * Shift encoder mask preparation to the encoder's forward. * Change to HF profiles. * Integration test standardization. * Clean up before integration test setup. * Remove strict float32, more similar to Qwen2Audio. * Use HF dataset links * Keep weights in BF16 * New audio in tests * Processor conventions. * Standardize audio token expansion in processor. * Add 'strip_prefix' to batch_decode * Batch decode nits. * Remove dtype casting. * Read token ids from tokenizer * diverse changes according to review * add training example * Add missing docstring. * Fix typos. * Add audio token docstring. * Fix fill type. * Fix docs * Save converted weights in bf16 * Fix tests * Keep model in bf16 for tests. * Update expected results for single. * Fix integration tests from runner. * Update reproducer, and dtype nits. --------- Co-authored-by: Eric B <ebezzam@gmail.com> Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>
1 parent fd36275 commit 1709ed9

19 files changed

+2741
-2
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1008,6 +1008,8 @@
10081008
title: AltCLIP
10091009
- local: model_doc/aria
10101010
title: Aria
1011+
- local: model_doc/audioflamingo3
1012+
title: AudioFlamingo3
10111013
- local: model_doc/aya_vision
10121014
title: AyaVision
10131015
- local: model_doc/blip

0 commit comments

Comments
 (0)