Module tokenizer

Module tokenizer 

Source
Expand description

Fast tokenizer using a Trie for longest-match encoding.

This module provides efficient text tokenization using a pre-built vocabulary. The tokenizer uses a singleton pattern with lazy initialization for optimal performance in multithreaded environments.

§Architecture

  • Array-based trie: Enables O(1) lookups with direct array indexing
  • Singleton pattern: Single global instance shared across threads using OnceLock
  • Greedy longest-match: Always selects the longest matching token at each position

§Example

use xpatch::tokenizer::{encode, decode, decode_to_string};

// Encode text to tokens
let text = b"Hello world";
let tokens = encode(text)?;

// Decode back to bytes
let decoded = decode(&tokens)?;
assert_eq!(decoded, text);

// Or decode to a string
let string = decode_to_string(&tokens)?;

§Performance Characteristics

  • Initialization: O(V*L) where V is vocabulary size, L is average token length
  • Encoding: O(n*m) where n is input length, m is the longest token
  • Decoding: O(t*l) where t is number of tokens, l is average token length
  • Memory: O(256*depth) for the array-based trie structure

Structs§

SimpleTokenizer
Simple and efficient tokenizer (singleton).

Functions§

decode
Decode token indices back to UTF-8 bytes using the global tokenizer.
decode_to_string
Decode token indices to a String using the global tokenizer.
encode
Encode UTF-8 text to token indices using the global tokenizer.