pub enum PreTokOp {
LiteralsCi {
patterns: Vec<String>,
},
Literals {
patterns: Vec<String>,
},
Letters {
lead_other: Option<bool>,
lead_space: Option<bool>,
},
Numbers {
max_run: Option<u32>,
lead_space: Option<bool>,
},
PunctRun {
lead_space: Option<bool>,
trailing_newlines: Option<bool>,
trailing_chars: Option<String>,
},
LettersCased {
kind: CasedKind,
lead_other: Option<bool>,
trailing_ci: Option<Vec<String>>,
},
NewlineBlock {},
TrailingWs {},
WsRun {},
MetaspaceSplit {
prefix_first: Option<bool>,
},
}Expand description
One op in a PreTokProgram. See module-level docs for semantics.
Variants§
LiteralsCi
(?i:p1|p2|...) — match the longest case-insensitive literal.
Literals
Case-sensitive literal alternatives — like LiteralsCi but matches
case-exact. Used by older OpenAI tokenizers (p50k_base, r50k_base).
Letters
\p{L}+, [^\r\n\p{L}\p{N}]?\p{L}+ when lead_other, or
?\p{L}+ when lead_space. The two lead flags are mutually
exclusive — lead_space is the older-OpenAI shape, lead_other
is the GPT-2 / Qwen / Llama-3 shape.
Numbers
\p{N}+ (unbounded) or \p{N}{1,K} when max_run > 0; with optional
? literal-space lead for older OpenAI tokenizers.
PunctRun
[ ?][^\s\p{L}\p{N}]+[\r\n]* with toggleable lead-space and
trailing-newlines.
Fields
LettersCased
Cased-letter run with optional trailing case-insensitive contractions.
Used by o200k_base / mistral-nemo, which split on case boundaries.
kind: "title" matches [Lu Lt Lm Lo M]* [Ll Lm Lo M]+,
kind: "upper" matches [Lu Lt Lm Lo M]+ [Ll Lm Lo M]*.
NewlineBlock
\s*[\r\n]+ — paragraph break with leading indentation.
TrailingWs
\s+(?!\S) — whitespace at end of input (or with only more ws after).
WsRun
\s+ — generic whitespace catchall (always last in GPT-2 programs).
MetaspaceSplit
SentencePiece-style splitter — single-op programs only.