Expand description
Real tokenizer of the embedding model for accurate token counting and chunking. Token-count utilities for embedding input sizing.
Provides fast approximate token counting used to decide whether a body fits in a single chunk or requires the multi-chunk splitter.
Functions§
- count_
passage_ tokens - Counts the tokens produced by encoding
textwith the passage prefix. - get_
model_ max_ length - Returns the model’s
model_max_lengthfromtokenizer_config.json. - get_
tokenizer - Returns the process-wide
Tokenizersingleton, initializing it on first call. - passage_
token_ offsets - Returns the byte-offset pairs
(start, end)for each token intext.