kitoken 0.11.0

Fast tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization
Documentation
# kitoken

[![Crates.io](https://img.shields.io/crates/v/kitoken)](https://crates.io/crates/kitoken)
[![NPM](https://img.shields.io/npm/v/kitoken)](https://www.npmjs.com/package/kitoken)
[![PyPI](https://img.shields.io/pypi/v/kitoken)](https://pypi.org/project/kitoken)
[![Tests & Checks](https://img.shields.io/github/actions/workflow/status/Systemcluster/kitoken/tests.yml?label=tests%20%26%20checks)](https://github.com/Systemcluster/kitoken/actions/workflows/tests.yml)

**Tokenizer for language models.**

<sup>**Tokenize text for Llama, Gemini, GPT-5, DeepSeek, Mistral and many others; in the web, on the client and any platform.**</sup>

```rust
use kitoken::Kitoken;
let encoder = Kitoken::from_web("hf:Qwen/Qwen3.5-9B")?;

let tokens = encoder.encode("Your future belongs to me.", true)?;
let string = String::from_utf8(encoder.decode(&tokens, true)?)?;

assert!(string == "Your future belongs to me.");
```

## Overview

Kitoken is a fast and versatile tokenizer for language models compatible with [SentencePiece](https://github.com/google/sentencepiece), [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers), [OpenAI Tiktoken](https://github.com/openai/tiktoken) and [Mistral Tekken](https://docs.mistral.ai/guides/tokenization), supporting BPE, Unigram and WordPiece tokenization.

- **Fast and efficient tokenization**\
  Faster than most other tokenizers in both common and uncommon scenarios; see the [benchmarks]#benchmarks for comparisons with different datasets.
- **Runs in all environments**\
  Native in Rust and with bindings for [Web]./packages/javascript, [Node]./packages/javascript and [Python]./packages/python; see [kitoken.dev]https://kitoken.dev for a web demo.
- **Supports input and output processing**\
  Including unicode-aware normalization, pre-tokenization and post-processing options.
- **Compact data encoding**\
  Definitions are stored in an efficient binary format and without merge list.

See also [`kitoken-cli`](./packages/cli) for Kitoken in the command line.

## Compatibility

Kitoken can load and convert most existing tokenizer formats. Every supported format is [tested](./tests) against the original implementation across a wide variety of inputs to ensure correctness and compatibility.

> [!NOTE]
> Most models on [Hugging Face]https://huggingface.co are supported. Just take the `tokenizer.json` or `spiece.model` and load it into Kitoken.

Kitoken aims to be output-identical with existing implementations for all models. <sup>See the notes below for differences in specific cases.</sup>

### SentencePiece

```rust
let encoder = Kitoken::from_file("models/gemma.model")?;
```

Kitoken can convert and initialize with SentencePiece models in `BPE` and `Unigram` format.

- `BPE` models are converted to `BytePair` definitions in character mode. A merge list is generated and sorted using the token scores, which is then used to sort the vocabulary by merge priority. The scores and the merge list are then discarded.
- `Unigram` models are converted to `Unigram` definitions retaining the token scores.

If the model does not contain a trainer definition, `Unigram` is assumed as the default encoding mode. Normalization options and the unicode normalization scheme are taken from the contained normalizer definition and converted to the respective Kitoken configurations.

<details>
<summary>Notes</summary>

- <sup>SentencePiece uses [different `nfkc` normalization rules in the `nmt_nfkc` and `nmt_nfkc_cf` schemes]https://github.com/google/sentencepiece/blob/master/doc/normalization.md than during regular `nfkc` normalization, preventing the normalization of `` to `~`. Kitoken uses the regular `nfkc` normalization rules for `nmt_nfkc` and `nmt_nfkc_cf`.</sup>
- <sup>SentencePiece's implementation of Unigram merges pieces with the same merge priority in a different order depending on preceding non-encodable pieces. Kitoken always merges pieces with the same merge priority in the same order, matching the behavior of Tokenizers.</sup>

</details>

### Tokenizers

```rust
let encoder = Kitoken::from_file("models/llama4.json")?;
```

Kitoken can convert and initialize with HuggingFace Tokenizers definitions for `BPE`, `Unigram` and `WordPiece` models.

- `BPE` models are converted to `BytePair` definitions. The included merge list is used to sort the vocabulary by merge priority and is then discarded.
- `Unigram` models are converted to `Unigram` definitions retaining the token scores.
- `WordPiece` models are converted to `WordPiece` definitions.

Normalization, pre-tokenization, post-processing and decoding options contained in the definition are converted to the respective Kitoken configurations.

Some normalization, post-processing and decoding options used by Tokenizers are used for converting alternative token-byte representations during encoding and decoding. Kitoken always stores and operates on tokens as byte sequences, and will use these options to pre-normalize the vocabulary during conversion.

<details>
<summary>Notes</summary>

- <sup>Tokenizers skips over non-encodable pieces and attempts to merge the surrounding ones when using an incomplete vocabulary without an `unk` token. Kitoken always considers non-encodable pieces as un-mergeable and encodes the surrounding pieces individually. This can affect models that exploit the behavior of Tokenizers with a deliberately restricted vocabulary.</sup>
- <sup>Tokenizers normalizes inputs character-by-character, while Kitoken normalizes inputs as one. This can result in differences during case-folding in some cases. For example, greek letter `Σ` has two lowercase forms, `σ` for within-word and `ς` for end-of-word use. Tokenizers will always lowercase `Σ` to `σ`, while Kitoken will lowercase it to either depending on the context.</sup>
- <sup>Tokenizers doesn't merge Metaspace replacement characters in inputs with spaces during encoding. Kitoken merges both as the same, matching the behavior of SentencePiece.</sup>

</details>

### Tiktoken

```rust
let encoder = Kitoken::from_file("models/o200k_base.tiktoken")?;
```

Tiktoken is a `BPE` tokenizer used by OpenAI for GPT-3 and newer models and uses `BytePair` tokenization in byte mode.

Tiktoken definitions contain a sorted vocabulary of base64 encoded bytes and corresponding token ids without any additional metadata. Special tokens and the split regex are expected to be provided separately, but will be inferred from the data for common models including GPT-3, GPT-4, GPT-4o, GPT-5 and others including Kimi and Llama 4.
For other models, or depending on the data and requirements, these values can be adjusted manually.

### Tekken

```rust
let encoder = Kitoken::from_file("models/mistral.json")?;
```

Tekken is a `BPE` tokenizer based on Tiktoken, used by Mistral for NeMo and newer models and uses `BytePair` tokenization in byte mode.

Tekken definitions contain a sorted vocabulary of base64 encoded bytes and corresponding token ids, as well as metadata including the split regex and special tokens.

## Performance

Kitoken uses merge-list-free variations of the BPE algorithm and a reversed variation of the Unigram algorithm. The basis for the merge-list-free BPE algorithm was inspired by [Tiktoken](https://github.com/openai/tiktoken), which has similarly good performance characteristics with common tokenization inputs. However, Kitoken can be much faster with inputs that fail to split during pre-tokenization by falling back to a priority-queue-based implementation when optimal.

The core tokenization functions are optimized for multiple CPU architectures and make use of SIMD instructions where available. Kitoken also avoids memory allocations and copying of data to great extent, and most operations are performed in-place and buffers are reused where possible.

### Benchmarks

Benchmarks were performed on a MacBook Pro M4 Max using each libraries Python bindings with [tokenizer-bench](https://github.com/Systemcluster/tokenizer-bench).

#### Gemma 3

<img src="./benches/encode_gemma3_time.svg" width="100%" alt="Encoding Benchmark: Gemma 3, time in seconds, 1000 iterations"/>

<details>
<summary>Encoding throughput</summary>
<img src="./benches/encode_gemma3_throughput.svg" width="100%" alt="Encoding Benchmark: Gemma 3, throughput in MB/s"/>
</details>

#### Llama 4

<img src="./benches/encode_llama4_time.svg" width="100%" alt="Encoding Benchmark: Llama 4, time in seconds, 1000 iterations"/>

<details>
<summary>Encoding throughput</summary>
<img src="./benches/encode_llama4_throughput.svg" width="100%" alt="Encoding Benchmark: Llama 4, throughput in MB/s"/>
</details>

#### Datasets

- **Pride and Prejudice**: A text document containing *Pride and Prejudice* by Jane Austen. This data is a good representation for common English-language inputs containing a mix of short and long paragraphs.

- **UTF-8 Sequence**: A text document containing a single-line UTF-8 sequence. This data is a good representation of inputs that stress pre-tokenization.

- **Wagahai**: A text document containing *Wagahai wa Neko de Aru* by Natsume Sōseki. This data is a good representation for Japanese-language inputs containing many long paragraphs.