1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
//! Token counting using tiktoken (cl100k_base encoding).
//!
//! Uses OpenAI's cl100k_base BPE tokenizer for accurate token estimation.
//! While Anthropic uses their own tokenizer internally, cl100k_base provides
//! a much closer approximation than chars/N heuristics (~5-10% variance vs ~30-50%).
//!
//! The tokenizer is initialized lazily via `once_cell` and reused across all calls.
use Lazy;
use CoreBPE;
/// Global tokenizer instance — initialized once, reused everywhere.
/// cl100k_base is used by GPT-4, GPT-3.5-turbo, and text-embedding-ada-002.
/// It's the closest publicly available tokenizer to what Anthropic uses.
static TOKENIZER: =
new;
/// Count tokens in a string using cl100k_base BPE encoding.
///
/// This is the single source of truth for token estimation across the entire
/// codebase. No more chars/3, chars/4, or any other heuristic.
///
/// # Returns
/// Actual BPE token count (minimum 1 for non-empty strings, 0 for empty).
/// Count tokens for a message with structural overhead.
///
/// Each message has ~4 tokens of overhead for role tags and separators.