Module tokenizer

Expand description

Real tokenizer of the embedding model for accurate token counting and chunking. Token-count utilities for embedding input sizing.

v1.0.76: the tokenizers crate was removed. Token counts are now approximated from whitespace-split word counts, calibrated by a WORDS_TO_TOKENS factor (default 0.75, conservative for English + the multilingual-e5 prefix that the LLM headless invocation prepends).

For passages shorter than EMBEDDING_MAX_TOKENS words, the count is exact. For longer passages, the count is approximate but still useful for the chunking decision in src/embedder.rs::embed_passages_controlled.

Functions§

count_passage_tokens: Returns the approximate token count for text when prefixed with prefix (e.g. passage: for embed_passage).
get_model_max_length: Returns the model’s max input length. Since we no longer have a tokenizer config, this returns the constant from constants.rs. Operators that need a different ceiling should set SQLITE_GRAPHRAG_EMBEDDING_MAX_TOKENS in the environment.
passage_token_offsets: Returns the byte-offset pairs (start, end) for each whitespace-delimited word in text. The tokenizers crate used to return true sub-word offsets; the LLM headless path doesn’t need that granularity, so we return word boundaries.