Expand description
Fast tokenizer using a Trie for longest-match encoding.
This module provides efficient text tokenization using a pre-built vocabulary. The tokenizer uses a singleton pattern with lazy initialization for optimal performance in multithreaded environments.
§Architecture
- Array-based trie: Enables O(1) lookups with direct array indexing
- Singleton pattern: Single global instance shared across threads using
OnceLock - Greedy longest-match: Always selects the longest matching token at each position
§Example
use xpatch::tokenizer::{encode, decode, decode_to_string};
// Encode text to tokens
let text = b"Hello world";
let tokens = encode(text)?;
// Decode back to bytes
let decoded = decode(&tokens)?;
assert_eq!(decoded, text);
// Or decode to a string
let string = decode_to_string(&tokens)?;§Performance Characteristics
- Initialization: O(V*L) where V is vocabulary size, L is average token length
- Encoding: O(n*m) where n is input length, m is the longest token
- Decoding: O(t*l) where t is number of tokens, l is average token length
- Memory: O(256*depth) for the array-based trie structure
Structs§
- Simple
Tokenizer - Simple and efficient tokenizer (singleton).
Functions§
- decode
- Decode token indices back to UTF-8 bytes using the global tokenizer.
- decode_
to_ string - Decode token indices to a String using the global tokenizer.
- encode
- Encode UTF-8 text to token indices using the global tokenizer.