Update deserialize of added tokens #1891

ArthurZucker · 2025-11-27T15:54:42Z

8-9x speedup on special tokens and about 4x on non-special cases for deserialization

Fixes #1635 and superseeds #1782

…on-and-add-token-metho

…token-metho' of github.com:ArthurZucker/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho

…dd tokens to the vocab man....

…ptimize-addedvocabulary-deserialization-and-add-token-metho

HuggingFaceDocBuilderDev · 2025-11-27T16:11:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ptimize-addedvocabulary-deserialization-and-add-token-metho

ArthurZucker · 2025-11-27T21:59:25Z

@jannisborn ty for testing 😉

…on-and-add-token-metho

…token-metho' of github.com:huggingface/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho

jannisborn · 2025-11-28T09:27:11Z

Amazing @ArthurZucker, it's still lightspeed fast. Loading a tokenizer with 150K tokens, 140K of which are special tokens, takes 3.5sec
I re-tokenized 200K samples and compared to previous tip of main and it's still identical to the last token 🚀

ArthurZucker · 2025-11-28T09:59:46Z

Damnnn amazing! 🤗

McPatate · 2025-11-28T10:55:59Z

.github/workflows/rust.yml

        with:
          command: audit
-          args: -D warnings -f ./tokenizers/Cargo.lock --ignore RUSTSEC-2024-0436 --ignore RUSTSEC-2025-0014
+          args: -D warnings -f ./tokenizers/Cargo.lock --ignore RUSTSEC-2024-0436 --ignore RUSTSEC-2025-0014 --ignore RUSTSEC-2025-0119


Could be the opportunity to do some cleaning within our deps, but don't know the context that much.

yeah just unmaintained...

ArthurZucker and others added 22 commits May 27, 2025 09:50

Add benchmark for deserializing large added vocab

7cf18aa

revert dumb stuff, isolate changes

47c2e9f

try to only normalize once

a8f6a71

Merge branch 'main' into codex/optimize-addedvocabulary-deserializati…

35a9427

…on-and-add-token-metho

small improvement?

8ba1d20

Merge branch 'codex/optimize-addedvocabulary-deserialization-and-add-…

a71555c

…token-metho' of github.com:ArthurZucker/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho

some updates

6714ceb

nit

e07ecfc

fmt

8849d71

normalized string are a fucking waste of time when you just want to a…

5da668a

…dd tokens to the vocab man....

more attempts

8e7ce86

works

948eead

let's fucking go, parity

43cef92

update

ae8a7b4

hahahhahaha

44beeb7

revert changes that are not actually even needed

e7f8954

add a python test!

8d49849

use normalizer before come on

d8f07fa

nit

f6df603

Merge branch 'main' of github.com:huggingface/tokenizers into codex/o…

96a9563

…ptimize-addedvocabulary-deserialization-and-add-token-metho

update to a more concrete usecase

bd671d1

fix build

236f8ce

ArthurZucker added 7 commits November 27, 2025 17:31

style

2f12a63

reduce sample size

ae4b990

--allow unmaintained

e20d5c7

clippy happy

8423fc8

up

a6b0a4d

Merge branch 'main' of github.com:huggingface/tokenizers into codex/o…

1bf0820

…ptimize-addedvocabulary-deserialization-and-add-token-metho

up

7c69aea

ArthurZucker added 4 commits November 27, 2025 19:30

derive impl

756c014

revert unrelated

669f78d

fmt

c998b42

ignore

80de975

ArthurZucker and others added 3 commits November 27, 2025 23:02

Merge branch 'main' into codex/optimize-addedvocabulary-deserializati…

3c24367

…on-and-add-token-metho

remove stupid file

ed86d8b

Merge branch 'codex/optimize-addedvocabulary-deserialization-and-add-…

5f9db87

…token-metho' of github.com:huggingface/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho

ArthurZucker merged commit d6a4acc into main Nov 27, 2025
30 checks passed

ArthurZucker deleted the codex/optimize-addedvocabulary-deserialization-and-add-token-metho branch November 27, 2025 22:07

ArthurZucker changed the title ~~Update serialization~~ Update deserialize of added tokens Nov 28, 2025

McPatate approved these changes Nov 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update deserialize of added tokens #1891

Update deserialize of added tokens #1891

Uh oh!

ArthurZucker commented Nov 27, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Nov 27, 2025

Uh oh!

ArthurZucker commented Nov 27, 2025

Uh oh!

Uh oh!

jannisborn commented Nov 28, 2025 •

edited

Loading

Uh oh!

ArthurZucker commented Nov 28, 2025

Uh oh!

McPatate Nov 28, 2025

Uh oh!

ArthurZucker Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Update deserialize of added tokens #1891

Update deserialize of added tokens #1891

Uh oh!

Conversation

ArthurZucker commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Nov 27, 2025

Uh oh!

ArthurZucker commented Nov 27, 2025

Uh oh!

Uh oh!

jannisborn commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Nov 28, 2025

Uh oh!

McPatate Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ArthurZucker commented Nov 27, 2025 •

edited

Loading

jannisborn commented Nov 28, 2025 •

edited

Loading