Expand description
§entelix-tokenizer-tiktoken
Vendor-accurate TokenCounter for OpenAI’s BPE tokenizer family —
cl100k_base, o200k_base, p50k_base, r50k_base. Wraps
tiktoken-rs with eager BPE
preload at construction so the per-call count stays synchronous
per the TokenCounter contract.
§Encoding to model mapping
TiktokenEncoding::Cl100kBase— GPT-3.5-turbo, GPT-4, GPT-4-turbo, text-embedding-3-*.TiktokenEncoding::O200kBase— GPT-4o, GPT-4o-mini, o1, o3, o3-mini, o4.TiktokenEncoding::P50kBase— GPT-3 davinci, codex.TiktokenEncoding::R50kBase— GPT-3 ada / babbage / curie, GPT-2.
The mapping is left to operators by design — OpenAI changes it over time, and accidentally pinning a stale mapping silently miscounts without surfacing a build failure. Pick the encoding for your target model and the wrapper preloads the matching BPE tables.
§Why eager preload
The TokenCounter trait is intentionally synchronous — counters
get called from inside hot dispatch paths (pre-flight RunBudget
checks, splitter sizing) where awaiting on a lazy table-load
introduces unbounded latency. TiktokenCounter therefore loads the
BPE tables eagerly inside TiktokenCounter::for_encoding and
caches them behind an Arc. Cloning a TiktokenCounter is
cheap; loading a fresh one re-parses the embedded tables so prefer
clone for fan-out.
Structs§
- Tiktoken
Counter TokenCounterimpl backed bytiktoken-rs.
Enums§
- Tiktoken
Encoding - OpenAI BPE encoding family. Pick the variant matching the target model — see the crate-level docs for the model-to-encoding table.
- Tiktoken
Error - Errors raised when constructing a
TiktokenCounter.