-
Notifications
You must be signed in to change notification settings - Fork 1k
Update deserialize of added tokens #1891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
ArthurZucker
merged 36 commits into
main
from
codex/optimize-addedvocabulary-deserialization-and-add-token-metho
Nov 27, 2025
Merged
Changes from all commits
Commits
Show all changes
36 commits
Select commit
Hold shift + click to select a range
7cf18aa
Add benchmark for deserializing large added vocab
ArthurZucker 47c2e9f
revert dumb stuff, isolate changes
ArthurZucker a8f6a71
try to only normalize once
ArthurZucker 35a9427
Merge branch 'main' into codex/optimize-addedvocabulary-deserializati…
ArthurZucker 8ba1d20
small improvement?
ArthurZucker a71555c
Merge branch 'codex/optimize-addedvocabulary-deserialization-and-add-…
ArthurZucker 6714ceb
some updates
ArthurZucker e07ecfc
nit
ArthurZucker 8849d71
fmt
ArthurZucker 5da668a
normalized string are a fucking waste of time when you just want to a…
ArthurZucker 8e7ce86
more attempts
ArthurZucker 948eead
works
ArthurZucker 43cef92
let's fucking go, parity
ArthurZucker ae8a7b4
update
ArthurZucker 44beeb7
hahahhahaha
ArthurZucker e7f8954
revert changes that are not actually even needed
ArthurZucker 8d49849
add a python test!
ArthurZucker d8f07fa
use normalizer before come on
ArthurZucker f6df603
nit
ArthurZucker 96a9563
Merge branch 'main' of github.com:huggingface/tokenizers into codex/o…
ArthurZucker bd671d1
update to a more concrete usecase
ArthurZucker 236f8ce
fix build
ArthurZucker 2f12a63
style
ArthurZucker ae4b990
reduce sample size
ArthurZucker e20d5c7
--allow unmaintained
ArthurZucker 8423fc8
clippy happy
ArthurZucker a6b0a4d
up
ArthurZucker 1bf0820
Merge branch 'main' of github.com:huggingface/tokenizers into codex/o…
ArthurZucker 7c69aea
up
ArthurZucker 756c014
derive impl
ArthurZucker 669f78d
revert unrelated
ArthurZucker c998b42
fmt
ArthurZucker 80de975
ignore
ArthurZucker 3c24367
Merge branch 'main' into codex/optimize-addedvocabulary-deserializati…
ArthurZucker ed86d8b
remove stupid file
ArthurZucker 5f9db87
Merge branch 'codex/optimize-addedvocabulary-deserialization-and-add-…
ArthurZucker File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| #[macro_use] | ||
| extern crate criterion; | ||
| use criterion::Criterion; | ||
| use std::hint::black_box; | ||
| use std::str::FromStr; | ||
| use tokenizers::{normalizers::*, AddedToken, Normalizer, Tokenizer}; | ||
|
|
||
| fn serialized_tokenizer<N: Normalizer + Into<NormalizerWrapper>>( | ||
| size: i64, | ||
| normalizer: Option<N>, | ||
| special_tokens: bool, | ||
| ) -> String { | ||
| let mut tokenizer = Tokenizer::from_pretrained("t5-small", None).unwrap(); | ||
|
|
||
| if let Some(norm) = normalizer { | ||
| tokenizer.with_normalizer(Some(norm)); | ||
| } | ||
|
|
||
| let tokens: Vec<_> = (0..size) | ||
| .map(|i| AddedToken::from(format!("tok{i}"), special_tokens)) | ||
| .collect(); | ||
| tokenizer.add_tokens(&tokens); | ||
|
|
||
| serde_json::to_string(&tokenizer).unwrap() | ||
| } | ||
|
|
||
| #[allow(clippy::type_complexity)] | ||
| fn bench_deserialize(c: &mut Criterion) { | ||
| let normalizers: Vec<(&str, Option<fn() -> NormalizerWrapper>)> = vec![ | ||
| ("none", None), | ||
| ("byte_level", Some(|| ByteLevel.into())), | ||
| ("lowercase", Some(|| Lowercase.into())), | ||
| ("nfc", Some(|| NFC.into())), | ||
| ("nfd", Some(|| NFD.into())), | ||
| ("nfkc", Some(|| NFKC.into())), | ||
| ("nfkd", Some(|| NFKD.into())), | ||
| ("nmt", Some(|| Nmt.into())), | ||
| ("strip", Some(|| Strip::new(true, true).into())), | ||
| ("replace", Some(|| Replace::new("a", "b").unwrap().into())), | ||
| ("prepend", Some(|| Prepend::new("pre_".to_string()).into())), | ||
| ("bert", Some(|| BertNormalizer::default().into())), | ||
| ]; | ||
|
|
||
| for &size in &[100_000, 400_000] { | ||
| for (norm_name, maybe_factory) in &normalizers { | ||
| let label = format!( | ||
| "special tokens deserialize_added_vocab_{}_norm_{}", | ||
| size, norm_name | ||
| ); | ||
|
|
||
| let json = match maybe_factory { | ||
| Some(factory) => serialized_tokenizer(size, Some(factory()), true), | ||
| None => serialized_tokenizer::<NormalizerWrapper>(size, None, true), | ||
| }; | ||
| c.bench_function(&label, |b| { | ||
| b.iter(|| { | ||
| let tok: Tokenizer = black_box(Tokenizer::from_str(&json).unwrap()); | ||
| black_box(tok); | ||
| }) | ||
| }); | ||
|
|
||
| let label = format!( | ||
| "non special deserialize_added_vocab_{}_norm_{}", | ||
| size, norm_name | ||
| ); | ||
|
|
||
| let json = match maybe_factory { | ||
| Some(factory) => serialized_tokenizer(size, Some(factory()), false), | ||
| None => serialized_tokenizer::<NormalizerWrapper>(size, None, false), | ||
| }; | ||
| c.bench_function(&label, |b| { | ||
| b.iter(|| { | ||
| let tok: Tokenizer = black_box(Tokenizer::from_str(&json).unwrap()); | ||
| black_box(tok); | ||
| }) | ||
| }); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| criterion_group! { | ||
| name = benches; | ||
| config = Criterion::default().significance_level(0.1).sample_size(10); | ||
| targets = bench_deserialize | ||
| } | ||
| criterion_main!(benches); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be the opportunity to do some cleaning within our deps, but don't know the context that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just unmaintained...