Expand description
§char-token-est
Estimate token counts from raw text without invoking a BPE tokenizer.
Real tokenization is fast but pulls in tens of MB of vocab data. For routing, budget gating, log lines, and progress bars you can get within ~10% accuracy with a per-model-family chars-per-token constant and no dependencies.
§Example
use char_token_est::{estimate, Family};
let text = "The quick brown fox jumps over the lazy dog.";
let n = estimate(text, Family::Gpt);
assert!(n >= 9 && n <= 14, "got {n}");§Calibration
Constants are derived from average chars-per-token over a multilingual
corpus of typical prompts (English + code + JSON). Pure-code or
non-Latin inputs deviate further; pass estimate_with_ratio to
supply your own ratio.
| Family | chars/token |
|---|---|
Gpt (GPT-4/5, o3/o4 cl100k_base) | 4.0 |
Claude | 3.5 |
Gemini | 4.0 |
Llama (Llama 3 tiktoken-32k) | 3.7 |
Cohere | 3.8 |
Enums§
- Family
- Model family used to pick a chars-per-token ratio.
Functions§
- estimate
- Estimate token count for
textusing the family’s chars-per-token. - estimate_
with_ ratio - Estimate token count using a caller-supplied chars-per-token ratio.