-
Notifications
You must be signed in to change notification settings - Fork 1k
Add benchmark for deserializing large added vocab + optimizations #1782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark for deserializing large added vocab + optimizations #1782
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…on-and-add-token-metho
…token-metho' of github.com:ArthurZucker/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho
…dd tokens to the vocab man....
|
@ArthurZucker this PR is fantastic! I was facing the same issue of extremely slow loading of tokenizers with many AddedTokens. But the PR is open since half a year, is there any chance it could be merged soon? It would be tremendously helpful. I can try to help to make it happen, pls let me know |
|
Hey! I am actually not sure it made anything faster, might be why I did not pursue further! |
|
does it help? |
|
Hi @ArthurZucker, I'm sure it makes things much faster! I have a tokenizer with 160K tokens from which 149K are special tokens and loading time with It would be amazing to have this included in the next release |
|
Ok I will finish it then! My bench might have needed just a bigger vocab! |
…ptimize-addedvocabulary-deserialization-and-add-token-metho
No description provided.