Crate char_token_est

Expand description

§char-token-est

Estimate token counts from raw text without invoking a BPE tokenizer.

Real tokenization is fast but pulls in tens of MB of vocab data. For routing, budget gating, log lines, and progress bars you can get within ~10% accuracy with a per-model-family chars-per-token constant and no dependencies.

§Example

use char_token_est::{estimate, Family};
let text = "The quick brown fox jumps over the lazy dog.";
let n = estimate(text, Family::Gpt);
assert!(n >= 9 && n <= 14, "got {n}");

§Calibration

Constants are derived from average chars-per-token over a multilingual corpus of typical prompts (English + code + JSON). Pure-code or non-Latin inputs deviate further; pass estimate_with_ratio to supply your own ratio.

Family	chars/token
`Gpt` (GPT-4/5, o3/o4 cl100k_base)	4.0
`Claude`	3.5
`Gemini`	4.0
`Llama` (Llama 3 tiktoken-32k)	3.7
`Cohere`	3.8

Enums§

Family: Model family used to pick a chars-per-token ratio.

Functions§

estimate: Estimate token count for text using the family’s chars-per-token.
estimate_with_ratio: Estimate token count using a caller-supplied chars-per-token ratio.