Expand description
Pre-tokenizer program interpreter.
Executes a PreTokProgram against an input string, producing the
same sequence of pieces that the legacy pre_tokenizer_pattern regex
would have produced. Mirror of @codecai/web’s pretok-program.ts
and codecai’s pretok_program.py; see
spec/PRETOKENIZER_PROGRAM.md
for the design rationale and op set.
Why this exists in the Rust client: the regex crate doesn’t support
lookaround (\s+(?!\S)) or ES2025 RegExp Pattern Modifiers
((?i:...)), both of which appear in every GPT-2-family
pre_tokenizer_pattern. Without the program interpreter, the Rust
BPETokenizer constructor fails before encode() runs on every
shipped Qwen / Llama-3 / Phi-4 / cl100k_base map. With the
interpreter, the program path bypasses regex entirely and the same
maps tokenise byte-for-byte against HuggingFace.
Structs§
- PreTok
Program - A compiled pre-tokenizer program. Carried alongside the legacy
pre_tokenizer_patternon v2.1+ maps. Runtimes prefer the program when present; falls back to the regex otherwise.
Enums§
- Cased
Kind - “Title” or “upper” cased-letter shape — see
PreTokOp::LettersCased. - PreTok
Op - One op in a
PreTokProgram. See module-level docs for semantics.
Functions§
- run_
pretok_ program - Execute
programagainsttext, returning the same piece sequence the legacy regex pre-tokenizer would have emitted.