@@ -7,6 +7,9 @@ OpenAI's models.
77import tiktoken
88enc = tiktoken.get_encoding(" gpt2" )
99assert enc.decode(enc.encode(" hello world" )) == " hello world"
10+
11+ # To get the tokeniser corresponding to a specific model in the OpenAI API:
12+ enc = tiktoken.encoding_for_model(" text-davinci-003" )
1013```
1114
1215The open source version of ` tiktoken ` can be installed from PyPI:
@@ -16,7 +19,9 @@ pip install tiktoken
1619
1720The tokeniser API is documented in ` tiktoken/core.py ` .
1821
19- Example code using ` tiktoken ` can be found in the [ OpenAI Cookbook] ( https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb ) .
22+ Example code using ` tiktoken ` can be found in the
23+ [ OpenAI Cookbook] ( https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb ) .
24+
2025
2126## Performance
2227
@@ -28,3 +33,72 @@ Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2Tokeni
2833` tokenizers==0.13.2 ` and ` transformers==4.24.0 ` .
2934
3035
36+ ## Getting help
37+
38+ Please post questions in the [ issue tracker] ( https://github.com/openai/tiktoken/issues ) .
39+
40+ If you work at OpenAI, make sure to check the internal documentation or feel free to contact
41+ @shantanu .
42+
43+
44+ ## Extending tiktoken
45+
46+ You may wish to extend ` tiktoken ` to support new encodings. There are two ways to do this.
47+
48+
49+ ** Create your ` Encoding ` object exactly the way you want and simply pass it around.**
50+
51+ ``` python
52+ cl100k_base = tiktoken.get_encoding(" cl100k_base" )
53+
54+ # In production, load the arguments directly instead of accessing private attributes
55+ # See openai_public.py for examples of arguments for specific encodings
56+ enc = tiktoken.Encoding(
57+ # If you're changing the set of special tokens, make sure to use a different name
58+ # It should be clear from the name what behaviour to expect.
59+ name = " cl100k_im" ,
60+ pat_str = cl100k_base._pat_str,
61+ mergeable_ranks = cl100k_base._mergeable_ranks,
62+ special_tokens = {
63+ ** cl100k_base._special_tokens,
64+ " <|im_start|>" : 100264 ,
65+ " <|im_end|>" : 100265 ,
66+ }
67+ )
68+ ```
69+
70+ ** Use the ` tiktoken_ext ` plugin mechanism to register your ` Encoding ` objects with ` tiktoken ` .**
71+
72+ This is only useful if you need ` tiktoken.get_encoding ` to find your encoding, otherwise prefer
73+ option 1.
74+
75+ To do this, you'll need to create a namespace package under ` tiktoken_ext ` .
76+
77+ Layout your project like this, making sure to omit the ` tiktoken_ext/__init__.py ` file:
78+ ```
79+ my_tiktoken_extension
80+ ├── tiktoken_ext
81+ │ └── my_encodings.py
82+ └── setup.py
83+ ```
84+
85+ ` my_encodings.py ` should be a module that contains a variable named ` ENCODING_CONSTRUCTORS ` .
86+ This is a dictionary from an encoding name to a function that takes no arguments and returns
87+ arguments that can be passed to ` tiktoken.Encoding ` to construct that encoding. For an example, see
88+ ` tiktoken_ext/openai_public.py ` . For precise details, see ` tiktoken/registry.py ` .
89+
90+ Your ` setup.py ` should look something like this:
91+ ``` python
92+ from setuptools import setup, find_namespace_packages
93+
94+ setup(
95+ name = " my_tiktoken_extension" ,
96+ packages = find_namespace_packages(include = [' tiktoken_ext.*' ])
97+ install_requires = [" tiktoken" ],
98+ ...
99+ )
100+ ```
101+
102+ Then simply ` pip install my_tiktoken_extension ` and you should be able to use your custom encodings!
103+ Make sure ** not** to use an editable install.
104+
0 commit comments