Add benchmark for deserializing large added vocab + optimizations #1782

ArthurZucker · 2025-05-27T10:09:25Z

No description provided.

HuggingFaceDocBuilderDev · 2025-05-27T10:16:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…on-and-add-token-metho

…token-metho' of github.com:ArthurZucker/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho

…dd tokens to the vocab man....

jannisborn · 2025-11-21T11:59:55Z

@ArthurZucker this PR is fantastic! I was facing the same issue of extremely slow loading of tokenizers with many AddedTokens. But the PR is open since half a year, is there any chance it could be merged soon? It would be tremendously helpful. I can try to help to make it happen, pls let me know

ArthurZucker · 2025-11-24T12:24:44Z

Hey! I am actually not sure it made anything faster, might be why I did not pursue further!

ArthurZucker · 2025-11-24T12:25:24Z

does it help?

jannisborn · 2025-11-24T12:47:52Z

Hi @ArthurZucker, I'm sure it makes things much faster! I have a tokenizer with 160K tokens from which 149K are special tokens and loading time with tokenizers==0.22.1 is around 3 minutes. Instead this PR loads the tokenizer in 3sec. To verify, I tokenized a test dataset of 200K samples with both versions and the results are identical up to the last token.

It would be amazing to have this included in the next release

ArthurZucker · 2025-11-24T16:47:18Z

Ok I will finish it then! My bench might have needed just a bigger vocab!
On it!

…ptimize-addedvocabulary-deserialization-and-add-token-metho

ArthurZucker and others added 3 commits May 27, 2025 09:50

Add benchmark for deserializing large added vocab

7cf18aa

revert dumb stuff, isolate changes

47c2e9f

try to only normalize once

a8f6a71

Merge branch 'main' into codex/optimize-addedvocabulary-deserializati…

35a9427

…on-and-add-token-metho

ArthurZucker mentioned this pull request May 27, 2025

Adding many AddedTokens makes loading a tokenizer extremely slow. #1635

Closed

ArthurZucker added 15 commits May 27, 2025 13:30

small improvement?

8ba1d20

Merge branch 'codex/optimize-addedvocabulary-deserialization-and-add-…

a71555c

…token-metho' of github.com:ArthurZucker/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho

some updates

6714ceb

nit

e07ecfc

fmt

8849d71

normalized string are a fucking waste of time when you just want to a…

5da668a

…dd tokens to the vocab man....

more attempts

8e7ce86

works

948eead

let's fucking go, parity

43cef92

update

ae8a7b4

hahahhahaha

44beeb7

revert changes that are not actually even needed

e7f8954

add a python test!

8d49849

use normalizer before come on

d8f07fa

nit

f6df603

Merge branch 'main' of github.com:huggingface/tokenizers into codex/o…

96a9563

…ptimize-addedvocabulary-deserialization-and-add-token-metho

ArthurZucker closed this Nov 27, 2025

ArthurZucker mentioned this pull request Nov 27, 2025

Update deserialize of added tokens #1891

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark for deserializing large added vocab + optimizations #1782

Add benchmark for deserializing large added vocab + optimizations #1782

Uh oh!

ArthurZucker commented May 27, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2025

Uh oh!

jannisborn commented Nov 21, 2025

Uh oh!

ArthurZucker commented Nov 24, 2025

Uh oh!

ArthurZucker commented Nov 24, 2025

Uh oh!

jannisborn commented Nov 24, 2025

Uh oh!

ArthurZucker commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add benchmark for deserializing large added vocab + optimizations #1782

Add benchmark for deserializing large added vocab + optimizations #1782

Uh oh!

Conversation

ArthurZucker commented May 27, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2025

Uh oh!

jannisborn commented Nov 21, 2025

Uh oh!

ArthurZucker commented Nov 24, 2025

Uh oh!

ArthurZucker commented Nov 24, 2025

Uh oh!

jannisborn commented Nov 24, 2025

Uh oh!

ArthurZucker commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants