
A high-performance BPE tokenizer built with Rust with Python bindings, focused on speed, safety, and resource optimization.
The Problem
Tokenization is everywhere in modern AI. Whether you're building LLM applications, training models, or processing data pipelines, you're tokenizing text constantly. But existing tokenizers have a problem: they're slow.
When you need to tokenize batches of prompts, documents, or training data, you're stuck waiting. Python-based tokenizers can't fully leverage modern multi-core CPUs. You need something faster.
The Solution
Splintr brings Rust performance to Python. Built from the ground up for speed and efficiency:

| Configuration | Splintr | Tiktoken | HuggingFace | TokenDagger |
|---|---|---|---|---|
| 1,000 texts | 111 MB/s | 9 MB/s | 28 MB/s | 9 MB/s |
| 500 texts | 107 MB/s | 10 MB/s | 27 MB/s | 8 MB/s |
| 100 texts | 69 MB/s | 7 MB/s | 20 MB/s | 6 MB/s |
10-12x faster than tiktoken. 4x faster than HuggingFace. Built in Rust, accessible from Python.
Quick Start
Python
# Load a pretrained vocabulary (OpenAI)
=
# Or load Llama 3 tokenizer (Meta) - supports all versions up to Llama 3.3
# tokenizer = Tokenizer.from_pretrained("llama3")
# Encode text to token IDs
=
# [9906, 11, 1917, 0]
# Decode token IDs back to text
=
# "Hello, world!"
# Batch encode multiple texts in parallel (this is where it shines)
=
=
# [[9906, 11, 1917, 0], [4438, 527, 499, 30], ...]
Rust
[]
= "0.4.0"
use ;
use FxHashMap;
// Load vocabulary and create tokenizer
let encoder = load_tiktoken_bpe_file?;
let special_tokens = default;
let tokenizer = new?;
// Encode text
let tokens = tokenizer.encode;
println!;
// Batch encode
let texts = vec!;
let batch_tokens = tokenizer.encode_batch;
Key Features
Performance where it matters:
- 12x faster batch encoding - Parallel processing across multiple texts using Rayon
- 3-4x faster single text encoding - Optimized sequential algorithm for typical use cases
- Smart parallelization - Sequential for small texts (<1MB), parallel for large datasets
- LRU caching - Avoid redundant encoding of frequently seen text chunks
Built for production:
- Compatible vocabularies - Supports cl100k_base, o200k_base (OpenAI), and Llama 3 family (Meta), with a familiar API
- Streaming decoder - Real-time LLM output display with proper UTF-8 handling
- 54 agent tokens - Built-in support for chat, CoT reasoning, ReAct agents, tool calling, RAG citations
- Battle-tested algorithms - PCRE2 with JIT, Aho-Corasick for special tokens, linked-list BPE
Cross-platform:
- Python bindings via PyO3 (Linux, macOS, Windows)
- Native Rust library for maximum performance
Performance Deep Dive
All benchmarks performed on Linux (6.16.8-arch3-1) with 24 CPU cores, comparing against tiktoken (reference Python implementation), Hugging Face tokenizers, and TokenDagger.
Single Text Encoding
For single texts, splintr achieves 3-4x faster encoding across various text sizes:

Latency by content type:

Consistent low latency across Python code, JSON, English prose, and Chinese text makes splintr ideal for interactive applications and real-time processing.
Batch Encoding
The real magic happens with batches. Splintr parallelizes across texts to achieve 10-12x speedup:

Higher speedups on larger batches where parallelization overhead is amortized. Perfect for:
- Training data preprocessing
- Bulk document tokenization
- API batch processing
- Data pipeline throughput
Design Decision: Sequential by Default
Splintr uses sequential encoding for single texts and parallel encoding across batches based on empirical benchmarking:

Key findings:
- Sequential is faster for texts up to ~1MB (typical LLM prompts and documents)
- Rayon's parallelization overhead only pays off at ~1MB+ text sizes
- Most real-world inputs are well under 1MB
encode()uses sequential processing for optimal single-text performanceencode_batch()parallelizes across multiple texts for maximum throughputencode_rayon()available for the rare cases where you have >1MB single texts
This architecture ensures splintr is optimized for the most common tokenization patterns in LLM applications.
Running Benchmarks Yourself
# Clone and install
# Run the benchmark suite
# View results
The benchmark suite tests single text encoding, batch encoding, streaming decoder performance, and special token handling across various content types.
Streaming Decoder
The streaming decoder is essential for real-time LLM applications where tokens arrive one at a time:
# Create a streaming decoder
=
# Process tokens one at a time (typical LLM streaming scenario)
# Returns text only when complete UTF-8 characters are available
# Flush any remaining buffered bytes at the end
Why You Need This
BPE tokens don't align with UTF-8 character boundaries. A multi-byte Unicode character like "世" (3 bytes: 0xE4 0xB8 0x96) might split across tokens. The streaming decoder:
- Buffers incomplete byte sequences across token boundaries
- Only outputs text when complete UTF-8 characters are available
- Prevents display corruption in streaming LLM output
- Handles edge cases automatically
Real-World Example
=
=
# Stream tokens from OpenAI API
=
# Process each token as it arrives
= # pseudo-code
# Don't forget to flush at the end
API Methods
Core operations:
add_token(token_id: int) -> str | None: Add a token, return complete characters or None if bufferingadd_tokens(token_ids: list[int]) -> str | None: Add multiple tokens at onceflush() -> str: Flush buffered bytes (incomplete sequences become �)reset(): Clear the buffer and start fresh
Properties:
has_pending: bool: Whether there are buffered bytes waitingpending_bytes: int: Number of bytes currently buffered
API Reference
Python API
Tokenizer
Loading:
# Load pretrained model (includes vocabulary and special tokens)
= # or "o200k_base", "llama3"
# Load from custom vocabulary file
=
Encoding:
encode(text: str) -> list[int]: Encode text to token IDs (sequential, optimal for most use cases)encode_with_special(text: str) -> list[int]: Encode text, recognizing special tokens in the inputencode_batch(texts: list[str]) -> list[list[int]]: Encode multiple texts in parallel (uses Rayon)encode_rayon(text: str) -> list[int]: Encode using Rayon parallelization (only beneficial for texts >1MB)
Decoding:
decode(tokens: list[int]) -> str: Decode token IDs to text (raises error on invalid UTF-8)decode_bytes(tokens: list[int]) -> bytes: Decode token IDs to raw bytesdecode_lossy(tokens: list[int]) -> str: Decode token IDs, replacing invalid UTF-8 with �
Properties:
vocab_size: int: Total vocabulary size including special tokenscache_len: int: Number of entries in the LRU cache
Cache management:
clear_cache(): Clear the encoding cache
Rust API
The Rust API provides similar functionality with strongly-typed interfaces:
Encoding:
encode(&self, text: &str) -> Vec<u32>: Sequential encoding (optimal for texts <1MB)encode_with_special(&self, text: &str) -> Vec<u32>: Encode with special token recognitionencode_batch(&self, texts: &[String]) -> Vec<Vec<u32>>: Parallel encoding across textsencode_rayon(&self, text: &str) -> Vec<u32>: Parallel encoding within text (for texts >1MB)
Decoding:
decode(&self, tokens: &[u32]) -> Result<String, TokenizerError>: Decode to UTF-8 stringdecode_bytes(&self, tokens: &[u32]) -> Vec<u8>: Decode to raw bytesdecode_lossy(&self, tokens: &[u32]) -> String: Decode with replacement for invalid UTF-8
See the API documentation for complete details.
Supported Vocabularies
| Vocabulary | Used By | Vocabulary Size | Special Tokens | Import Constant |
|---|---|---|---|---|
cl100k_base |
GPT-4, GPT-3.5-turbo | ~100,000 | 5 + 54 agent | CL100K_BASE_PATTERN |
o200k_base |
GPT-4o | ~200,000 | 2 + 54 agent | O200K_BASE_PATTERN |
llama3 |
Llama 3, 3.1, 3.2, 3.3 (Meta) | ~128,000 | 11 + 54 agent | LLAMA3_PATTERN |
OpenAI standard tokens:
- cl100k_base:
<|endoftext|>,<|fim_prefix|>,<|fim_middle|>,<|fim_suffix|>,<|endofprompt|> - o200k_base:
<|endoftext|>,<|endofprompt|>
Meta Llama 3 standard tokens:
- llama3:
<|begin_of_text|>,<|end_of_text|>,<|start_header_id|>,<|end_header_id|>,<|eot_id|>,<|eom_id|>(3.1+),<|python_tag|>(3.1+),<|step_id|>(3.2-Vision),<|image|>(3.2-Vision)
Agent Tokens (54 per model)
Splintr extends all vocabularies with tokens for building agent systems. See docs/special_tokens.md for complete documentation.
# OpenAI models
=
=
=
# 100282
# 100292
# Llama 3 models (vocabulary includes all special tokens up to Llama 3.3)
=
=
# 128305
# 128315
# 128000 (official Meta token)
# 128256 (official Meta 3.2-Vision token)
| Category | Tokens | Purpose |
|---|---|---|
| Conversation | system, user, assistant, im_start, im_end |
ChatML format |
| Thinking | think |
Chain-of-Thought reasoning |
| ReAct | plan, step, act, observe |
Agent action loops |
| Tools | function, result, error |
Function calling |
| Code | code, output, lang |
Code execution |
| RAG | context, quote, cite, source |
Citations |
| Memory | memory, recall |
State persistence |
| Control | pad, stop, sep |
Sequence control |
| Multimodal | image, audio, video |
Non-text content |
| Document | title, section, summary |
Structured docs |
How It Works
Splintr implements several optimizations that make tokenization faster:
- PCRE2 with JIT compilation: 2-4x speedup on regex pattern matching
- Rayon parallelism: Leverages multiple CPU cores for batch encoding
- Linked-list BPE algorithm: Avoids O(N²) complexity on pathological inputs
- FxHashMap: Faster lookups than default SipHash for non-adversarial contexts
- Aho-Corasick for special tokens: Fast multi-pattern matching without regex alternation
- LRU cache: Avoids redundant BPE encoding of frequently seen chunks
Use Cases
LLM Applications:
- Tokenizing prompts with 3-4x lower latency
- Streaming decoder for real-time output display
- Token counting for API cost estimation
Agent Systems:
- Building ReAct agents with structured reasoning tokens
- Tool-calling systems with function tokens
- Chain-of-Thought reasoning with thinking tokens
Training Pipelines:
- Fast batch encoding of large datasets (10-12x speedup)
- Preprocessing millions of documents efficiently
- Parallel tokenization across distributed systems
RAG Applications:
- Structured context injection with citation tokens
- Document chunking with section markers
- Source tracking through tokenization
Data Processing:
- Bulk document tokenization
- Multi-language text processing
- Real-time text preprocessing
Contributing
Contributions are welcome! Here's how you can help:
- Report bugs: Open an issue with a minimal reproduction case
- Suggest features: Describe your use case and why the feature would be helpful
- Submit pull requests:
- Add tests for new functionality
- Run
cargo testandcargo clippybefore submitting - Update documentation as needed
Development Setup
# Clone the repository
# Install pre-commit hook (recommended)
# Build the Rust library
# Build Python bindings
# Run tests
The pre-commit hook automatically runs formatting, clippy, and tests before each commit.
Acknowledgments
Splintr builds upon concepts from:
- tiktoken - OpenAI's reference BPE tokenizer
- tokenizers - Hugging Face's tokenization library
The performance optimizations are informed by profiling real-world usage patterns in LLM applications.
Citation
If you use Splintr in your research, please cite: