entelix-tokenizer-tiktoken
Vendor-accurate TokenCounter
for OpenAI's BPE tokenizer family — cl100k_base, o200k_base,
p50k_base, r50k_base. Wraps tiktoken-rs
with eager BPE preload at construction so the per-call count stays
synchronous per the TokenCounter contract.
[]
= "0.4"
= "0.4"
use Arc;
use ;
use ;
let counter: =
new;
let config = default.with_token_counter;
assert_eq!;
Encoding to model mapping
| Encoding | Models |
|---|---|
Cl100kBase |
GPT-3.5-turbo, GPT-4, GPT-4-turbo, text-embedding-3-* |
O200kBase |
GPT-4o, GPT-4o-mini, o1, o3, o3-mini, o4 |
P50kBase |
GPT-3 davinci, codex |
R50kBase |
GPT-3 ada / babbage / curie, GPT-2 |
The mapping is left to operators by design — OpenAI changes it over time, and accidentally pinning to a stale mapping silently miscounts without surfacing a build failure. Pick the encoding for your target model from the table above and the wrapper preloads the matching BPE tables.
Position in the workspace
Companion crate to entelix-core — TokenCounter is the trait
surface; this crate provides one of several concrete impls. Sister
crates ship the HuggingFace Tokenizers backend (entelix-tokenizer-hf)
and locale-aware morphological counters (Korean / Japanese).
The zero-dependency entelix_core::ByteCountTokenCounter ships in
core for development scaffolding; production multilingual workloads
should swap in this crate (or a sibling) at
ChatModelConfig::with_token_counter.
License
MIT