tokenx-rs
Rust port of tokenx by Johann Schopplich
Fast token count estimation for LLMs at 96% accuracy without a full tokenizer.
Why?
- No vocabulary files: Full tokenizers like tiktoken require 2-4MB of BPE vocabulary data
- Zero dependencies: No regex, no allocations — just a single-pass character scanner
- Fast: ~8x faster than the original Node.js implementation on Latin text
- Accurate enough: 96% accuracy is sufficient for token budget estimation and streaming display
- Universal: Works across all LLM providers (OpenAI, Anthropic, Google, etc.)
Installation
[]
= "0.1"
Usage
use estimate_token_count;
let tokens = estimate_token_count;
println!;
Check token limits
use is_within_token_limit;
if is_within_token_limit
Split text into chunks
use split_by_tokens;
let chunks = split_by_tokens;
How it works
The estimator makes a single pass over the input string, classifying each character and grouping runs of similar characters into segments. Each segment is scored by type:
| Segment Type | Rule |
|---|---|
| Whitespace | 0 tokens |
| CJK characters | 1 token per character |
| Digit sequences | 1 token |
| Short words (≤3 bytes) | 1 token |
| Punctuation runs | ceil(len / 2) |
| Alphanumeric words | ceil(len / chars_per_token) |
| Other (emojis, mixed) | 1 token per character |
Language-specific diacritics (German, French, Spanish) adjust the chars_per_token ratio for more accurate estimates on non-English text.
There is no regex, no string splitting allocation, and no runtime dependencies.
Performance
Benchmarked against the original tokenx Node.js library on the same machine (Apple Silicon, Node v20, rustc 1.91):
| Input | tokenx (Node.js) | tokenx-rs (Rust) | Speedup |
|---|---|---|---|
| Short text (~9 words) | 830 ns | 107 ns | 7.8x |
| Medium text (~900 words) | 89.6 µs | 10.9 µs | 8.2x |
| Long text (~27k words) | 2.91 ms | 332 µs | 8.8x |
| CJK text (1200 chars) | 6.34 µs | 4.04 µs | 1.6x |
| Code (~100 fn blocks) | 443 µs | 52.9 µs | 8.4x |
Run benchmarks yourself with cargo bench.
Accuracy
Accuracy benchmarks from the original tokenx project (compared against tiktoken cl100k_base):
| Content | Actual Tokens | Estimated | Deviation |
|---|---|---|---|
| Short English text | 19 | 19 | 0.00% |
| German text with umlauts | 48 | 49 | 2.08% |
| Kafka - Metamorphosis (English) | 31,796 | 32,325 | 1.66% |
| Kafka - Die Verwandlung (German) | 35,309 | 33,970 | 3.79% |
| 道德經 - Laozi (Chinese) | 11,712 | 11,427 | 2.43% |
| 羅生門 - Akutagawa (Japanese) | 9,517 | 10,535 | 10.70% |
| TypeScript ES5 declarations | 49,293 | 51,599 | 4.68% |
License
MIT