Skip to main content

Crate char_token_est

Crate char_token_est 

Source
Expand description

§char-token-est

Estimate token counts from raw text without invoking a BPE tokenizer.

Real tokenization is fast but pulls in tens of MB of vocab data. For routing, budget gating, log lines, and progress bars you can get within ~10% accuracy with a per-model-family chars-per-token constant and no dependencies.

§Example

use char_token_est::{estimate, Family};
let text = "The quick brown fox jumps over the lazy dog.";
let n = estimate(text, Family::Gpt);
assert!(n >= 9 && n <= 14, "got {n}");

§Calibration

Constants are derived from average chars-per-token over a multilingual corpus of typical prompts (English + code + JSON). Pure-code or non-Latin inputs deviate further; pass estimate_with_ratio to supply your own ratio.

Familychars/token
Gpt (GPT-4/5, o3/o4 cl100k_base)4.0
Claude3.5
Gemini4.0
Llama (Llama 3 tiktoken-32k)3.7
Cohere3.8

Enums§

Family
Model family used to pick a chars-per-token ratio.

Functions§

estimate
Estimate token count for text using the family’s chars-per-token.
estimate_with_ratio
Estimate token count using a caller-supplied chars-per-token ratio.