splintr 0.4.0

Fast Rust BPE tokenizer with Python bindings
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
![Splintr](images/splntr.png)

[![Crates.io](https://img.shields.io/crates/v/splintr.svg)](https://crates.io/crates/splintr) [![PyPI](https://img.shields.io/pypi/v/splintr-rs.svg)](https://pypi.org/project/splintr-rs/) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

**A high-performance BPE tokenizer built with Rust with Python bindings, focused on speed, safety, and resource optimization.**

## The Problem

Tokenization is everywhere in modern AI. Whether you're building LLM applications, training models, or processing data pipelines, you're tokenizing text constantly. But existing tokenizers have a problem: they're slow.

When you need to tokenize batches of prompts, documents, or training data, you're stuck waiting. Python-based tokenizers can't fully leverage modern multi-core CPUs. You need something faster.

## The Solution

Splintr brings Rust performance to Python. Built from the ground up for speed and efficiency:

![Batch Encoding Throughput](images/benchmark_batch.png)

| Configuration | Splintr      | Tiktoken | HuggingFace | TokenDagger |
| ------------- | ------------ | -------- | ----------- | ----------- |
| 1,000 texts   | **111 MB/s** | 9 MB/s   | 28 MB/s     | 9 MB/s      |
| 500 texts     | **107 MB/s** | 10 MB/s  | 27 MB/s     | 8 MB/s      |
| 100 texts     | **69 MB/s**  | 7 MB/s   | 20 MB/s     | 6 MB/s      |

**10-12x faster than tiktoken. 4x faster than HuggingFace. Built in Rust, accessible from Python.**

## Quick Start

### Python

```bash
pip install splintr-rs
```

```python
from splintr import Tokenizer

# Load a pretrained vocabulary (OpenAI)
tokenizer = Tokenizer.from_pretrained("cl100k_base")

# Or load Llama 3 tokenizer (Meta) - supports all versions up to Llama 3.3
# tokenizer = Tokenizer.from_pretrained("llama3")

# Encode text to token IDs
tokens = tokenizer.encode("Hello, world!")
print(tokens)  # [9906, 11, 1917, 0]

# Decode token IDs back to text
text = tokenizer.decode(tokens)
print(text)  # "Hello, world!"

# Batch encode multiple texts in parallel (this is where it shines)
texts = ["Hello, world!", "How are you?", "Machine learning is fun!"]
batch_tokens = tokenizer.encode_batch(texts)
print(batch_tokens)  # [[9906, 11, 1917, 0], [4438, 527, 499, 30], ...]
```

### Rust

```toml
[dependencies]
splintr = "0.4.0"
```

```rust
use splintr::{Tokenizer, CL100K_BASE_PATTERN};
use rustc_hash::FxHashMap;

// Load vocabulary and create tokenizer
let encoder = load_tiktoken_bpe_file("cl100k_base.tiktoken")?;
let special_tokens = FxHashMap::default();
let tokenizer = Tokenizer::new(encoder, special_tokens, CL100K_BASE_PATTERN)?;

// Encode text
let tokens = tokenizer.encode("Hello, world!");
println!("{:?}", tokens);

// Batch encode
let texts = vec!["Hello".to_string(), "World".to_string()];
let batch_tokens = tokenizer.encode_batch(&texts);
```

## Key Features

**Performance where it matters:**

- **12x faster batch encoding** - Parallel processing across multiple texts using Rayon
- **3-4x faster single text encoding** - Optimized sequential algorithm for typical use cases
- **Smart parallelization** - Sequential for small texts (<1MB), parallel for large datasets
- **LRU caching** - Avoid redundant encoding of frequently seen text chunks

**Built for production:**

- **Compatible vocabularies** - Supports cl100k_base, o200k_base (OpenAI), and Llama 3 family (Meta), with a familiar API
- **Streaming decoder** - Real-time LLM output display with proper UTF-8 handling
- **54 agent tokens** - Built-in support for chat, CoT reasoning, ReAct agents, tool calling, RAG citations
- **Battle-tested algorithms** - PCRE2 with JIT, Aho-Corasick for special tokens, linked-list BPE

**Cross-platform:**

- Python bindings via PyO3 (Linux, macOS, Windows)
- Native Rust library for maximum performance

## Performance Deep Dive

All benchmarks performed on Linux (6.16.8-arch3-1) with 24 CPU cores, comparing against tiktoken (reference Python implementation), Hugging Face tokenizers, and TokenDagger.

### Single Text Encoding

For single texts, splintr achieves **3-4x faster** encoding across various text sizes:

![Single Text Encoding Comparison](images/benchmark_single.png)

**Latency by content type:**

![Latency Comparison](images/benchmark_single_latency.png)

Consistent low latency across Python code, JSON, English prose, and Chinese text makes splintr ideal for interactive applications and real-time processing.

### Batch Encoding

The real magic happens with batches. Splintr parallelizes across texts to achieve **10-12x speedup**:

![Batch Speedup vs Tiktoken](images/benchmark_batch_speedup.png)

Higher speedups on larger batches where parallelization overhead is amortized. Perfect for:

- Training data preprocessing
- Bulk document tokenization
- API batch processing
- Data pipeline throughput

### Design Decision: Sequential by Default

Splintr uses **sequential encoding for single texts** and **parallel encoding across batches** based on empirical benchmarking:

![Sequential vs Rayon Internal Parallelization](images/benchmark_splintr.png)

**Key findings:**

- Sequential is faster for texts up to ~1MB (typical LLM prompts and documents)
- Rayon's parallelization overhead only pays off at ~1MB+ text sizes
- Most real-world inputs are well under 1MB
- `encode()` uses sequential processing for optimal single-text performance
- `encode_batch()` parallelizes across multiple texts for maximum throughput
- `encode_rayon()` available for the rare cases where you have >1MB single texts

This architecture ensures splintr is optimized for the most common tokenization patterns in LLM applications.

### Running Benchmarks Yourself

```bash
# Clone and install
git clone https://github.com/farhan-syah/splintr.git
cd splintr
pip install -e .
pip install tiktoken

# Run the benchmark suite
cd benchmarks
python benchmark.py --model cl100k_base --output results/my_benchmark.json

# View results
cat results/my_benchmark.md
```

The benchmark suite tests single text encoding, batch encoding, streaming decoder performance, and special token handling across various content types.

## Streaming Decoder

The streaming decoder is essential for real-time LLM applications where tokens arrive one at a time:

```python
# Create a streaming decoder
decoder = tokenizer.streaming_decoder()

# Process tokens one at a time (typical LLM streaming scenario)
for token_id in token_stream:
    # Returns text only when complete UTF-8 characters are available
    if text := decoder.add_token(token_id):
        print(text, end="", flush=True)

# Flush any remaining buffered bytes at the end
print(decoder.flush())
```

### Why You Need This

BPE tokens don't align with UTF-8 character boundaries. A multi-byte Unicode character like "世" (3 bytes: `0xE4 0xB8 0x96`) might split across tokens. The streaming decoder:

1. Buffers incomplete byte sequences across token boundaries
2. Only outputs text when complete UTF-8 characters are available
3. Prevents display corruption in streaming LLM output
4. Handles edge cases automatically

### Real-World Example

```python
import openai
from splintr import Tokenizer

tokenizer = Tokenizer.from_pretrained("cl100k_base")
decoder = tokenizer.streaming_decoder()

# Stream tokens from OpenAI API
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        # Process each token as it arrives
        token_ids = get_token_ids(chunk)  # pseudo-code
        for token_id in token_ids:
            if text := decoder.add_token(token_id):
                print(text, end="", flush=True)

# Don't forget to flush at the end
print(decoder.flush())
```

### API Methods

**Core operations:**

- `add_token(token_id: int) -> str | None`: Add a token, return complete characters or None if buffering
- `add_tokens(token_ids: list[int]) -> str | None`: Add multiple tokens at once
- `flush() -> str`: Flush buffered bytes (incomplete sequences become �)
- `reset()`: Clear the buffer and start fresh

**Properties:**

- `has_pending: bool`: Whether there are buffered bytes waiting
- `pending_bytes: int`: Number of bytes currently buffered

## API Reference

### Python API

#### Tokenizer

**Loading:**

```python
# Load pretrained model (includes vocabulary and special tokens)
tokenizer = Tokenizer.from_pretrained("cl100k_base")  # or "o200k_base", "llama3"

# Load from custom vocabulary file
tokenizer = Tokenizer(
    vocab_path="path/to/vocab.tiktoken",
    pattern=CL100K_BASE_PATTERN,
    special_tokens={"<|endoftext|>": 100257}
)
```

**Encoding:**

- `encode(text: str) -> list[int]`: Encode text to token IDs (sequential, optimal for most use cases)
- `encode_with_special(text: str) -> list[int]`: Encode text, recognizing special tokens in the input
- `encode_batch(texts: list[str]) -> list[list[int]]`: Encode multiple texts in parallel (uses Rayon)
- `encode_rayon(text: str) -> list[int]`: Encode using Rayon parallelization (only beneficial for texts >1MB)

**Decoding:**

- `decode(tokens: list[int]) -> str`: Decode token IDs to text (raises error on invalid UTF-8)
- `decode_bytes(tokens: list[int]) -> bytes`: Decode token IDs to raw bytes
- `decode_lossy(tokens: list[int]) -> str`: Decode token IDs, replacing invalid UTF-8 with �

**Properties:**

- `vocab_size: int`: Total vocabulary size including special tokens
- `cache_len: int`: Number of entries in the LRU cache

**Cache management:**

- `clear_cache()`: Clear the encoding cache

### Rust API

The Rust API provides similar functionality with strongly-typed interfaces:

**Encoding:**

- `encode(&self, text: &str) -> Vec<u32>`: Sequential encoding (optimal for texts <1MB)
- `encode_with_special(&self, text: &str) -> Vec<u32>`: Encode with special token recognition
- `encode_batch(&self, texts: &[String]) -> Vec<Vec<u32>>`: Parallel encoding across texts
- `encode_rayon(&self, text: &str) -> Vec<u32>`: Parallel encoding within text (for texts >1MB)

**Decoding:**

- `decode(&self, tokens: &[u32]) -> Result<String, TokenizerError>`: Decode to UTF-8 string
- `decode_bytes(&self, tokens: &[u32]) -> Vec<u8>`: Decode to raw bytes
- `decode_lossy(&self, tokens: &[u32]) -> String`: Decode with replacement for invalid UTF-8

See the [API documentation](https://docs.rs/splintr) for complete details.

## Supported Vocabularies

| Vocabulary    | Used By                       | Vocabulary Size | Special Tokens | Import Constant       |
| ------------- | ----------------------------- | --------------- | -------------- | --------------------- |
| `cl100k_base` | GPT-4, GPT-3.5-turbo          | ~100,000        | 5 + 54 agent   | `CL100K_BASE_PATTERN` |
| `o200k_base`  | GPT-4o                        | ~200,000        | 2 + 54 agent   | `O200K_BASE_PATTERN`  |
| `llama3`      | Llama 3, 3.1, 3.2, 3.3 (Meta) | ~128,000        | 11 + 54 agent  | `LLAMA3_PATTERN`      |

**OpenAI standard tokens:**

- **cl100k_base**: `<|endoftext|>`, `<|fim_prefix|>`, `<|fim_middle|>`, `<|fim_suffix|>`, `<|endofprompt|>`
- **o200k_base**: `<|endoftext|>`, `<|endofprompt|>`

**Meta Llama 3 standard tokens:**

- **llama3**: `<|begin_of_text|>`, `<|end_of_text|>`, `<|start_header_id|>`, `<|end_header_id|>`, `<|eot_id|>`, `<|eom_id|>` (3.1+), `<|python_tag|>` (3.1+), `<|step_id|>` (3.2-Vision), `<|image|>` (3.2-Vision)

### Agent Tokens (54 per model)

Splintr extends all vocabularies with tokens for building agent systems. See [docs/special_tokens.md](docs/special_tokens.md) for complete documentation.

```python
from splintr import Tokenizer, CL100K_AGENT_TOKENS, LLAMA3_AGENT_TOKENS

# OpenAI models
tokenizer = Tokenizer.from_pretrained("cl100k_base")
text = "<|think|>Let me reason...<|/think|>The answer is 42."
tokens = tokenizer.encode_with_special(text)
print(CL100K_AGENT_TOKENS.THINK)      # 100282
print(CL100K_AGENT_TOKENS.FUNCTION)   # 100292

# Llama 3 models (vocabulary includes all special tokens up to Llama 3.3)
tokenizer = Tokenizer.from_pretrained("llama3")
tokens = tokenizer.encode_with_special(text)
print(LLAMA3_AGENT_TOKENS.THINK)           # 128305
print(LLAMA3_AGENT_TOKENS.FUNCTION)        # 128315
print(LLAMA3_AGENT_TOKENS.BEGIN_OF_TEXT)   # 128000 (official Meta token)
print(LLAMA3_AGENT_TOKENS.IMAGE)           # 128256 (official Meta 3.2-Vision token)
```

| Category     | Tokens                                              | Purpose                    |
| ------------ | --------------------------------------------------- | -------------------------- |
| Conversation | `system`, `user`, `assistant`, `im_start`, `im_end` | ChatML format              |
| Thinking     | `think`                                             | Chain-of-Thought reasoning |
| ReAct        | `plan`, `step`, `act`, `observe`                    | Agent action loops         |
| Tools        | `function`, `result`, `error`                       | Function calling           |
| Code         | `code`, `output`, `lang`                            | Code execution             |
| RAG          | `context`, `quote`, `cite`, `source`                | Citations                  |
| Memory       | `memory`, `recall`                                  | State persistence          |
| Control      | `pad`, `stop`, `sep`                                | Sequence control           |
| Multimodal   | `image`, `audio`, `video`                           | Non-text content           |
| Document     | `title`, `section`, `summary`                       | Structured docs            |

## How It Works

Splintr implements several optimizations that make tokenization faster:

- **PCRE2 with JIT compilation**: 2-4x speedup on regex pattern matching
- **Rayon parallelism**: Leverages multiple CPU cores for batch encoding
- **Linked-list BPE algorithm**: Avoids O(N²) complexity on pathological inputs
- **FxHashMap**: Faster lookups than default SipHash for non-adversarial contexts
- **Aho-Corasick for special tokens**: Fast multi-pattern matching without regex alternation
- **LRU cache**: Avoids redundant BPE encoding of frequently seen chunks

## Use Cases

**LLM Applications:**

- Tokenizing prompts with 3-4x lower latency
- Streaming decoder for real-time output display
- Token counting for API cost estimation

**Agent Systems:**

- Building ReAct agents with structured reasoning tokens
- Tool-calling systems with function tokens
- Chain-of-Thought reasoning with thinking tokens

**Training Pipelines:**

- Fast batch encoding of large datasets (10-12x speedup)
- Preprocessing millions of documents efficiently
- Parallel tokenization across distributed systems

**RAG Applications:**

- Structured context injection with citation tokens
- Document chunking with section markers
- Source tracking through tokenization

**Data Processing:**

- Bulk document tokenization
- Multi-language text processing
- Real-time text preprocessing

## Contributing

Contributions are welcome! Here's how you can help:

1. **Report bugs**: Open an issue with a minimal reproduction case
2. **Suggest features**: Describe your use case and why the feature would be helpful
3. **Submit pull requests**:
   - Add tests for new functionality
   - Run `cargo test` and `cargo clippy` before submitting
   - Update documentation as needed

### Development Setup

```bash
# Clone the repository
git clone https://github.com/farhan-syah/splintr.git
cd splintr

# Install pre-commit hook (recommended)
cp hooks/pre-commit .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit

# Build the Rust library
cargo build --release

# Build Python bindings
pip install maturin
maturin develop --release

# Run tests
cargo test                    # Rust tests
cargo clippy --all-targets    # Linting
cargo fmt --all --check       # Format check
```

The pre-commit hook automatically runs formatting, clippy, and tests before each commit.

## Acknowledgments

Splintr builds upon concepts from:

- [tiktoken]https://github.com/openai/tiktoken - OpenAI's reference BPE tokenizer
- [tokenizers]https://github.com/huggingface/tokenizers - Hugging Face's tokenization library

The performance optimizations are informed by profiling real-world usage patterns in LLM applications.

## Citation

If you use Splintr in your research, please cite:

```bibtex
@software{splintr,
  author = {Farhan Syah},
  title = {Splintr: High-Performance BPE Tokenizer},
  year = {2025},
  url = {https://github.com/farhan-syah/splintr}
}
```