Module tokenizer

Expand description

Structural text tokenizer trait and whitespace implementation. Tokenization primitives used across chunking, sampling, and BM25 indexing.

§Structural tokenizers vs. model tokenizers

The Tokenizer trait and its default implementation, WhitespaceTokenizer, are structural tokenizers — their token counts drive window sizing, prefix budget arithmetic, and BM25 term-frequency scoring. They are not the subword tokenizers used by embedding or language models, which include:

BPE (Byte-Pair Encoding) — GPT-series, RoBERTa, most OpenAI encoders.
WordPiece — BERT-family models.
SentencePiece / Unigram — T5, LLaMA, Mistral, and most instruction-tuned LLMs.

Subword tokenizers operate on a learned vocabulary and routinely split a single word into multiple tokens. Whitespace token counts are a structural estimate, running roughly 0.75–1.3× the equivalent BPE token count depending on vocabulary and language. Exact model token counts are unnecessary and prohibitively expensive to compute without a loaded tokenizer binary.

Structs§

WhitespaceTokenizer: Unicode-scalar whitespace tokenizer.

Traits§

Tokenizer: Tokenizer over text slices.