Skip to content

Conversation

@ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Nov 27, 2025

8-9x speedup on special tokens and about 4x on non-special cases for deserialization

Fixes #1635 and superseeds #1782

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Collaborator Author

@jannisborn ty for testing 😉

ArthurZucker and others added 3 commits November 27, 2025 23:02
…token-metho' of github.com:huggingface/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho
@ArthurZucker ArthurZucker merged commit d6a4acc into main Nov 27, 2025
30 checks passed
@ArthurZucker ArthurZucker deleted the codex/optimize-addedvocabulary-deserialization-and-add-token-metho branch November 27, 2025 22:07
@jannisborn
Copy link

jannisborn commented Nov 28, 2025

Amazing @ArthurZucker, it's still lightspeed fast. Loading a tokenizer with 150K tokens, 140K of which are special tokens, takes 3.5sec
I re-tokenized 200K samples and compared to previous tip of main and it's still identical to the last token 🚀

@ArthurZucker
Copy link
Collaborator Author

Damnnn amazing! 🤗

@ArthurZucker ArthurZucker changed the title Update serialization Update deserialize of added tokens Nov 28, 2025
with:
command: audit
args: -D warnings -f ./tokenizers/Cargo.lock --ignore RUSTSEC-2024-0436 --ignore RUSTSEC-2025-0014
args: -D warnings -f ./tokenizers/Cargo.lock --ignore RUSTSEC-2024-0436 --ignore RUSTSEC-2025-0014 --ignore RUSTSEC-2025-0119
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be the opportunity to do some cleaning within our deps, but don't know the context that much.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just unmaintained...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding many AddedTokens makes loading a tokenizer extremely slow.

5 participants