Crate entelix_tokenizer_tiktoken

Expand description

§entelix-tokenizer-tiktoken

Vendor-accurate TokenCounter for OpenAI’s BPE tokenizer family — cl100k_base, o200k_base, p50k_base, r50k_base. Wraps tiktoken-rs with eager BPE preload at construction so the per-call count stays synchronous per the TokenCounter contract.

§Encoding to model mapping

TiktokenEncoding::Cl100kBase — GPT-3.5-turbo, GPT-4, GPT-4-turbo, text-embedding-3-*.
TiktokenEncoding::O200kBase — GPT-4o, GPT-4o-mini, o1, o3, o3-mini, o4.
TiktokenEncoding::P50kBase — GPT-3 davinci, codex.
TiktokenEncoding::R50kBase — GPT-3 ada / babbage / curie, GPT-2.

The mapping is left to operators by design — OpenAI changes it over time, and accidentally pinning a stale mapping silently miscounts without surfacing a build failure. Pick the encoding for your target model and the wrapper preloads the matching BPE tables.

§Why eager preload

The TokenCounter trait is intentionally synchronous — counters get called from inside hot dispatch paths (pre-flight RunBudget checks, splitter sizing) where awaiting on a lazy table-load introduces unbounded latency. TiktokenCounter therefore loads the BPE tables eagerly inside TiktokenCounter::for_encoding and caches them behind an Arc. Cloning a TiktokenCounter is cheap; loading a fresh one re-parses the embedded tables so prefer clone for fan-out.

Structs§

TiktokenCounter: TokenCounter impl backed by tiktoken-rs.

Enums§

TiktokenEncoding: OpenAI BPE encoding family. Pick the variant matching the target model — see the crate-level docs for the model-to-encoding table.
TiktokenError: Errors raised when constructing a TiktokenCounter.