Commit 1709ed9
[models] Add AudioFlamingo3 integration (#40290)
* Audio Flamingo 3 initial integration
* Added local Qwen
* Moving to AF3
* Loading directly from HF
* Formatting
* add snapshot_download
* Loading from hub
* Import gating
* Pass audio arrays directly
* Remove requires_backend
* Move constants to config.json
* Remove redundancies
* Separate tokenizer, cleaner from_pretrained
* Remove LlavaMetaModel
* Remove sound tower wrapper
* Merged BasicSoundEncoder
* Some improvements
* Towards AudioFlamingo3
* Migrate LlavaConfig
* Merge LlavaMetaForCausalLM into AudioFlamingo3ForConditionalGeneration
* Remove redundant lines
* Add AudioFlamingo3PreTrainedModel
* Unified model.safetensors
* Inline MM projector
* Tokenizer in root dir
* Default processor from_pretrained
* Remove tokenizer from modeling
* Added types
* Cleanup
* Docs & license
* device handling
* Change year
* Remove redundant methods
* Use BatchFeature
* Streamline audio feature handling
* Batch inference
* Reorder alphabetically
* Make style check
* Make fixup
* Avoid calls to separate functions
* Remove forward_tower()
* Rename encode_sound to get_audio_features for clarity
* Add batch decoding method to AudioFlamingo3Processor
* Use tensors instead of lists
* Move end embed token eval
* Prepare audio_features_mask in the processor
* No hardcoded 750 and 3000
* Remove _load_sound_mask completely and use WhisperFeatureExtractor
* Compute embeddings separately
* MM Projector is audio adaptor
* Simplify AudioFlamingo3Config initialization with default encoder_config
* Add modular
* Clean up
* make fixup
* Cleanup processing, add params to encoder config
* Remove redundant methods
* update config references, improve method names, and enhance logging in processor
* processor: move FE args to audio_kwargs, use common_kwargs for return_tensors
* Qwen-like processor
* Simplified AudioFlamingo3Processor
* Extract common code from generate() and forward()
* Add conversion script for AudioFlamingo3 to Hugging Face format
* Use save_pretrained()
* Don't overwrite gen config
* Use AutoTokenizer and FE to convert the processor
* minor formatting
* Finalize processor, do token expansion inside
* AudioFlamingo3: refactor docs, types, and audio–text feature merge
* AudioFlamingo3 Docs
* Add AudioFlamingo3Processor to AutoProcessor
* Processor tests
* Use audio_config instead of encoder_config
* Add audio_token_id to config
* Cleanup & new keys
* Add links
* Improved processor
* Handle conversational input
* Make processing consistent.
* Add fallback for no sound token, default left padding.
* Cleanup
* Replace manual 4D mask with masking_utils; dtype/device from inputs
* Text only mode
* Finalize processor
* Export processor directly
* Add push_to_hub to converter
* Add model_input_names property to AudioFlamingo3Processor to pass tests
* Processor chat template support
* Added Jinja processor chat template with audio support
* Processor tests
* Model tests
* Added docs
* Don't use common_kwargs in __call__
* Pass 'test_left_padding_compatibility' by never treating padding as content
* Updated docs
* Cleanup docs
* Standardization
* Update conversion script weight mapping.
* Flatten _build_square_attn_mask
* Make style
* Small dim and attn mask fix
* Fix processor padding side bug
* Error handling in converter
* Use position_ids
* Cleanup generation config
* Use precomputed position embeddings in AudioFlamingo3 encoder
* Added usage examples
* Fix generation config
* Integration tests
* Simplify modeling and shift part of mask preparation to processor. And update tests.
* Updated docs
* ASR convenience method
* Fixed tests
* make fixup
* Shift encoder mask preparation to the encoder's forward.
* Change to HF profiles.
* Integration test standardization.
* Clean up before integration test setup.
* Remove strict float32, more similar to Qwen2Audio.
* Use HF dataset links
* Keep weights in BF16
* New audio in tests
* Processor conventions.
* Standardize audio token expansion in processor.
* Add 'strip_prefix' to batch_decode
* Batch decode nits.
* Remove dtype casting.
* Read token ids from tokenizer
* diverse changes according to review
* add training example
* Add missing docstring.
* Fix typos.
* Add audio token docstring.
* Fix fill type.
* Fix docs
* Save converted weights in bf16
* Fix tests
* Keep model in bf16 for tests.
* Update expected results for single.
* Fix integration tests from runner.
* Update reproducer, and dtype nits.
---------
Co-authored-by: Eric B <ebezzam@gmail.com>
Co-authored-by: Eustache Le Bihan <eulebihan@gmail.com>1 parent fd36275 commit 1709ed9
File tree
19 files changed
+2741
-2
lines changed- docs/source/en
- model_doc
- src/transformers/models
- audioflamingo3
- auto
- voxtral
- tests
- fixtures/audioflamingo3
- models/audioflamingo3
19 files changed
+2741
-2
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1008 | 1008 | | |
1009 | 1009 | | |
1010 | 1010 | | |
| 1011 | + | |
| 1012 | + | |
1011 | 1013 | | |
1012 | 1014 | | |
1013 | 1015 | | |
| |||
0 commit comments