gpt_tokenizer 0.1.0

Rust BPE Encoder Decoder (Tokenizer) for GPT-2 / GPT-3
Documentation
# GPT-Tokenizer

An implementation of the GPT-3 tokenizer created by converting the [`GPT-3-Encoder`](https://www.npmjs.com/package/gpt-3-encoder)
JavaScript package to Rust (with the help of ChatGPT-4). You can use it to estimate the number of
tokens that your prompt would approximately consume. You can also create your own custom `encoding` and
`decoding` functions by providing your own `encoder.json` and `vocab.bpe` files.

> As a rule of thumb, OpenAI suggest that 100 tokens equal 75 words.

See how it works against the tokenizer published by OpenAI:

[https://platform.openai.com/tokenizer](https://platform.openai.com/tokenizer)

```rust
use tokenizer::DefaultTokenizer;

fn main() {
    let tokenizer = DefaultTokenizer::new();

    let text = r#"I'Many words map to one token, but some don't: indivisible.

Unicode characters like emojis may be split into many tokens containing the underlying bytes: 🤚🏾

Sequences of characters commonly found next to each other may be grouped together: 1234567890"#;

    let encoded = &tokenizer.encode(text);
    let decoded = &tokenizer.decode(encoded);

    println!("Original text: {}", text);
    println!("Encoded text: {:#?}", encoded);
    println!("Decoded text: {}", decoded

    println!("Text size: {}", text.len());
    println!("Words: {}", text.split(" ").count());
    println!("Rule of Thumb: {}", text.split(" ").count() * 4 / 3);
    println!("Tokens: {}", encoded.len());
}
```

See the [./examples](./examples) directory to see more examples of how to use it.