tokenx-rs 0.1.0

Fast token count estimation for LLMs at 96% accuracy without a full tokenizer
Documentation
  • Coverage
  • 100%
    18 out of 18 items documented6 out of 6 items with examples
  • Size
  • Source code size: 56.06 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 2.37 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 13s Average build duration of successful builds.
  • all releases: 13s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • qbit-ai/tokenx-rs
    1 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • xlyk

tokenx-rs

Rust port of tokenx by Johann Schopplich

Fast token count estimation for LLMs at 96% accuracy without a full tokenizer.

Crates.io Documentation License: MIT

Why?

  • No vocabulary files: Full tokenizers like tiktoken require 2-4MB of BPE vocabulary data
  • Zero dependencies: No regex, no allocations — just a single-pass character scanner
  • Fast: ~8x faster than the original Node.js implementation on Latin text
  • Accurate enough: 96% accuracy is sufficient for token budget estimation and streaming display
  • Universal: Works across all LLM providers (OpenAI, Anthropic, Google, etc.)

Installation

[dependencies]
tokenx-rs = "0.1"

Usage

use tokenx_rs::estimate_token_count;

let tokens = estimate_token_count("Hello, world!");
println!("Estimated tokens: {}", tokens);

Check token limits

use tokenx_rs::is_within_token_limit;

if is_within_token_limit(prompt, 4096) {
    // safe to send
}

Split text into chunks

use tokenx_rs::split_by_tokens;

let chunks = split_by_tokens(long_text, 1000);

How it works

The estimator makes a single pass over the input string, classifying each character and grouping runs of similar characters into segments. Each segment is scored by type:

Segment Type Rule
Whitespace 0 tokens
CJK characters 1 token per character
Digit sequences 1 token
Short words (≤3 bytes) 1 token
Punctuation runs ceil(len / 2)
Alphanumeric words ceil(len / chars_per_token)
Other (emojis, mixed) 1 token per character

Language-specific diacritics (German, French, Spanish) adjust the chars_per_token ratio for more accurate estimates on non-English text.

There is no regex, no string splitting allocation, and no runtime dependencies.

Performance

Benchmarked against the original tokenx Node.js library on the same machine (Apple Silicon, Node v20, rustc 1.91):

Input tokenx (Node.js) tokenx-rs (Rust) Speedup
Short text (~9 words) 830 ns 107 ns 7.8x
Medium text (~900 words) 89.6 µs 10.9 µs 8.2x
Long text (~27k words) 2.91 ms 332 µs 8.8x
CJK text (1200 chars) 6.34 µs 4.04 µs 1.6x
Code (~100 fn blocks) 443 µs 52.9 µs 8.4x

Run benchmarks yourself with cargo bench.

Accuracy

Accuracy benchmarks from the original tokenx project (compared against tiktoken cl100k_base):

Content Actual Tokens Estimated Deviation
Short English text 19 19 0.00%
German text with umlauts 48 49 2.08%
Kafka - Metamorphosis (English) 31,796 32,325 1.66%
Kafka - Die Verwandlung (German) 35,309 33,970 3.79%
道德經 - Laozi (Chinese) 11,712 11,427 2.43%
羅生門 - Akutagawa (Japanese) 9,517 10,535 10.70%
TypeScript ES5 declarations 49,293 51,599 4.68%

License

MIT