Expand description
Structural text tokenizer trait and whitespace implementation. Tokenization primitives used across chunking, sampling, and BM25 indexing.
§Structural tokenizers vs. model tokenizers
The Tokenizer trait and its default implementation, WhitespaceTokenizer,
are structural tokenizers — their token counts drive window sizing, prefix
budget arithmetic, and BM25 term-frequency scoring. They are not the
subword tokenizers used by embedding or language models, which include:
- BPE (Byte-Pair Encoding) — GPT-series, RoBERTa, most OpenAI encoders.
- WordPiece — BERT-family models.
- SentencePiece / Unigram — T5, LLaMA, Mistral, and most instruction-tuned LLMs.
Subword tokenizers operate on a learned vocabulary and routinely split a single word into multiple tokens. Whitespace token counts are a structural estimate, running roughly 0.75–1.3× the equivalent BPE token count depending on vocabulary and language. Exact model token counts are unnecessary and prohibitively expensive to compute without a loaded tokenizer binary.
Structs§
- Whitespace
Tokenizer - Unicode-scalar whitespace tokenizer.
Traits§
- Tokenizer
- Tokenizer over text slices.