Module pretok_program

Expand description

Pre-tokenizer program interpreter.

Executes a PreTokProgram against an input string, producing the same sequence of pieces that the legacy pre_tokenizer_pattern regex would have produced. Mirror of @codecai/web’s pretok-program.ts and codecai’s pretok_program.py; see spec/PRETOKENIZER_PROGRAM.md for the design rationale and op set.

Why this exists in the Rust client: the regex crate doesn’t support lookaround (\s+(?!\S)) or ES2025 RegExp Pattern Modifiers ((?i:...)), both of which appear in every GPT-2-family pre_tokenizer_pattern. Without the program interpreter, the Rust BPETokenizer constructor fails before encode() runs on every shipped Qwen / Llama-3 / Phi-4 / cl100k_base map. With the interpreter, the program path bypasses regex entirely and the same maps tokenise byte-for-byte against HuggingFace.

Structs§

PreTokProgram: A compiled pre-tokenizer program. Carried alongside the legacy pre_tokenizer_pattern on v2.1+ maps. Runtimes prefer the program when present; falls back to the regex otherwise.

Enums§

CasedKind: “Title” or “upper” cased-letter shape — see PreTokOp::LettersCased.
PreTokOp: One op in a PreTokProgram. See module-level docs for semantics.

Functions§

run_pretok_program: Execute program against text, returning the same piece sequence the legacy regex pre-tokenizer would have emitted.