Skip to main content

Module tokenizer

Module tokenizer 

Source
Expand description

Structural text tokenizer trait and whitespace implementation. Tokenization primitives used across chunking, sampling, and BM25 indexing.

§Structural tokenizers vs. model tokenizers

The Tokenizer trait and its default implementation, WhitespaceTokenizer, are structural tokenizers — their token counts drive window sizing, prefix budget arithmetic, and BM25 term-frequency scoring. They are not the subword tokenizers used by embedding or language models, which include:

  • BPE (Byte-Pair Encoding) — GPT-series, RoBERTa, most OpenAI encoders.
  • WordPiece — BERT-family models.
  • SentencePiece / Unigram — T5, LLaMA, Mistral, and most instruction-tuned LLMs.

Subword tokenizers operate on a learned vocabulary and routinely split a single word into multiple tokens. Whitespace token counts are a structural estimate, running roughly 0.75–1.3× the equivalent BPE token count depending on vocabulary and language. Exact model token counts are unnecessary and prohibitively expensive to compute without a loaded tokenizer binary.

Structs§

WhitespaceTokenizer
Unicode-scalar whitespace tokenizer.

Traits§

Tokenizer
Tokenizer over text slices.