Skip to main content

Crate entelix_tokenizer_tiktoken

Crate entelix_tokenizer_tiktoken 

Source
Expand description

§entelix-tokenizer-tiktoken

Vendor-accurate TokenCounter for OpenAI’s BPE tokenizer family — cl100k_base, o200k_base, p50k_base, r50k_base. Wraps tiktoken-rs with eager BPE preload at construction so the per-call count stays synchronous per the TokenCounter contract.

§Encoding to model mapping

The mapping is left to operators by design — OpenAI changes it over time, and accidentally pinning a stale mapping silently miscounts without surfacing a build failure. Pick the encoding for your target model and the wrapper preloads the matching BPE tables.

§Why eager preload

The TokenCounter trait is intentionally synchronous — counters get called from inside hot dispatch paths (pre-flight RunBudget checks, splitter sizing) where awaiting on a lazy table-load introduces unbounded latency. TiktokenCounter therefore loads the BPE tables eagerly inside TiktokenCounter::for_encoding and caches them behind an Arc. Cloning a TiktokenCounter is cheap; loading a fresh one re-parses the embedded tables so prefer clone for fan-out.

Structs§

TiktokenCounter
TokenCounter impl backed by tiktoken-rs.

Enums§

TiktokenEncoding
OpenAI BPE encoding family. Pick the variant matching the target model — see the crate-level docs for the model-to-encoding table.
TiktokenError
Errors raised when constructing a TiktokenCounter.