Skip to content

Conversation

@ArthurZucker
Copy link
Collaborator

No description provided.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jannisborn
Copy link

@ArthurZucker this PR is fantastic! I was facing the same issue of extremely slow loading of tokenizers with many AddedTokens. But the PR is open since half a year, is there any chance it could be merged soon? It would be tremendously helpful. I can try to help to make it happen, pls let me know

@ArthurZucker
Copy link
Collaborator Author

Hey! I am actually not sure it made anything faster, might be why I did not pursue further!

@ArthurZucker
Copy link
Collaborator Author

does it help?

@jannisborn
Copy link

Hi @ArthurZucker, I'm sure it makes things much faster! I have a tokenizer with 160K tokens from which 149K are special tokens and loading time with tokenizers==0.22.1 is around 3 minutes. Instead this PR loads the tokenizer in 3sec. To verify, I tokenized a test dataset of 200K samples with both versions and the results are identical up to the last token.

It would be amazing to have this included in the next release

@ArthurZucker
Copy link
Collaborator Author

Ok I will finish it then! My bench might have needed just a bigger vocab!
On it!

…ptimize-addedvocabulary-deserialization-and-add-token-metho
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants