Module tokenizer

Expand description

Fast tokenizer using a Trie for longest-match encoding.

This module provides efficient text tokenization using a pre-built vocabulary. The tokenizer uses a singleton pattern with lazy initialization for optimal performance in multithreaded environments.

§Architecture

Array-based trie: Enables O(1) lookups with direct array indexing
Singleton pattern: Single global instance shared across threads using OnceLock
Greedy longest-match: Always selects the longest matching token at each position

§Example

use xpatch::tokenizer::{encode, decode, decode_to_string};

// Encode text to tokens
let text = b"Hello world";
let tokens = encode(text)?;

// Decode back to bytes
let decoded = decode(&tokens)?;
assert_eq!(decoded, text);

// Or decode to a string
let string = decode_to_string(&tokens)?;

§Performance Characteristics

Initialization: O(V*L) where V is vocabulary size, L is average token length
Encoding: O(n*m) where n is input length, m is the longest token
Decoding: O(t*l) where t is number of tokens, l is average token length
Memory: O(256*depth) for the array-based trie structure

Structs§

SimpleTokenizer: Simple and efficient tokenizer (singleton).

Functions§

decode: Decode token indices back to UTF-8 bytes using the global tokenizer.
decode_to_string: Decode token indices to a String using the global tokenizer.
encode: Encode UTF-8 text to token indices using the global tokenizer.

Module tokenizer

Module tokenizer Copy item path

§Architecture

§Example

§Performance Characteristics

Structs§

Functions§

Module tokenizer