pub enum TokenCounterKind {
HuggingFace {
model_id: String,
},
HuggingFaceFile {
path: PathBuf,
},
TikToken,
Word,
}Expand description
Selects which token counting implementation to use.
from_env() picks the best available counter based on env vars and the current
embedding provider setting. WordCounter is the last-resort fallback.
Variants§
HuggingFace
Accurate BPE/WordPiece via a HuggingFace tokenizer model ID (requires network or cache).
HuggingFaceFile
Accurate BPE/WordPiece from a local tokenizer.json file.
TikToken
TikToken cl100k_base BPE (for OpenAI models).
Word
Whitespace word count. Last-resort fallback.
Implementations§
Source§impl TokenCounterKind
impl TokenCounterKind
Sourcepub fn from_env() -> Self
pub fn from_env() -> Self
Determine the best available token counter from the environment.
Mirrors Python’s LiteLLMEmbeddingEngine.get_tokenizer() logic, which selects a
tokenizer based on the provider and stores it on the engine instance. Python’s
chunk_by_sentence() calls embedding_engine.tokenizer.count_tokens() directly —
the tokenizer is a property of the engine, not a separate config. The Rust design
decouples them (TokenCounterKind is independent of the engine), but the selection
logic below preserves the same provider → tokenizer mapping.
Priority order (highest wins):
COGNEE_TOKEN_COUNTER=tiktoken→ TikTokenCOGNEE_TOKEN_COUNTER=huggingfaceorCOGNEE_TOKEN_COUNTER=hf→ checkHUGGINGFACE_TOKENIZERHUGGINGFACE_TOKENIZERenv var is set → HuggingFace { model_id }EMBEDDING_PROVIDER=onnxorfastembedandEMBEDDING_TOKENIZER_PATHis set and the file exists → HuggingFaceFileEMBEDDING_PROVIDER=openaioropenai_compatible→ TikTokenEMBEDDING_PROVIDER=ollamaandHUGGINGFACE_TOKENIZERset → HuggingFace- Fallback → Word
Sourcepub fn build(self) -> Result<Box<dyn TokenCounter + Send + Sync>, ChunkingError>
pub fn build(self) -> Result<Box<dyn TokenCounter + Send + Sync>, ChunkingError>
Construct a boxed TokenCounter from this kind.
Returns an error if the selected kind cannot be constructed (e.g. file not found,
model download failed). When the relevant Cargo feature is disabled, silently falls
back to WordCounter and logs a warning — so the crate compiles without optional
features but users get a visible signal that their configured tokenizer is inactive.
Trait Implementations§
Source§impl Clone for TokenCounterKind
impl Clone for TokenCounterKind
Source§fn clone(&self) -> TokenCounterKind
fn clone(&self) -> TokenCounterKind
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more