You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GPT-2 and GPT-3 use byte pair encoding to turn text into a series of integers to feed into the model. This is a javascript implementation of OpenAI's original python encoder/decoder which can be found [here](https://github.com/openai/gpt-2)
6
+
7
+
This specific encoder is used in one of my [WordPress plugins](https://coderevolution.ro), to count the number of tokens a string will use when sent to OpenAI API.
8
+
9
+
10
+
## Usage
11
+
12
+
The mbstring PHP extension is needed for this tool to work correctly (in case non-ASCII characters are present in the tokenized text: [details here on how to install mbstring](https://www.php.net/manual/en/mbstring.installation.php)
13
+
14
+
15
+
```php
16
+
17
+
$prompt = "Many words map to one token, but some don't: indivisible. Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾 Sequences of characters commonly found next to each other may be grouped together: 1234567890";
0 commit comments