Skip to main content

Module token_counter

Module token_counter 

Source
Expand description

Token counting for budget pipeline and savings telemetry.

Two modes are supported:

  • Tokenizer::Heuristic — fast chars / 3.5 approximation, ±20% accuracy. Used for budget enforcement, where over-/under-shoot of a few percent is absorbed by the 20% budget margin.

  • Tokenizer::Cl100kBase / Tokenizer::O200kBase — exact BPE counts via tiktoken-rs. Slower (1–10 µs per call after warm-up, plus ~5 ms one-time table load) but reproducible across host and eval scripts. Required for the §Savings Accounting reporting rule in Paper 2 — quoted savings numbers are tokenizer-dependent and must be computed against a named tokenizer.

The default tokenizer for new code is Tokenizer::Heuristic so that existing callers (budget pipeline, doc examples) keep their fast path. Telemetry / tune should explicitly pick a BPE variant.

Enums§

Tokenizer
Tokenizer choice for token counting.

Functions§

estimate_tokens
Estimate the number of tokens in text via the heuristic.
tokens_to_chars
Estimate the number of characters for a given token budget.