Expand description
Real tokenizer of the embedding model for accurate token counting and chunking. Token-count utilities for embedding input sizing.
v1.0.76: the tokenizers crate was removed. Token counts are now
approximated from whitespace-split word counts, calibrated by a
WORDS_TO_TOKENS factor (default 0.75, conservative for English +
the multilingual-e5 prefix that the LLM headless invocation prepends).
For passages shorter than EMBEDDING_MAX_TOKENS words, the count
is exact. For longer passages, the count is approximate but still
useful for the chunking decision in src/embedder.rs::embed_passages_controlled.
Functions§
- count_
passage_ tokens - Returns the approximate token count for
textwhen prefixed withprefix(e.g.passage:forembed_passage). - get_
model_ max_ length - Returns the model’s max input length. Since we no longer have a
tokenizer config, this returns the constant from
constants.rs. Operators that need a different ceiling should setSQLITE_GRAPHRAG_EMBEDDING_MAX_TOKENSin the environment. - passage_
token_ offsets - Returns the byte-offset pairs
(start, end)for each whitespace-delimited word intext. The tokenizers crate used to return true sub-word offsets; the LLM headless path doesn’t need that granularity, so we return word boundaries.