Module tokenizer

Expand description

Subword Tokenization Module (#26)

Just-in-Time tokenization for training pipelines with BPE and WordPiece support. Includes integration with aprender for HuggingFace-compatible tokenizer loading.

§Toyota Principle: Just-in-Time (ジャスト・イン・タイム)

Tokenize on demand during training, not upfront - reducing memory footprint and enabling dynamic vocabulary adaptation.

§Example

use entrenar::tokenizer::{BPETokenizer, Tokenizer, TokenizerConfig};

fn example() -> Result<(), Box<dyn std::error::Error>> {
    // Create a BPE tokenizer
    let config = TokenizerConfig::bpe().with_vocab_size(1000);
    let mut tokenizer = BPETokenizer::new(config);

    // Train on corpus
    let corpus = vec!["hello world", "hello there"];
    tokenizer.train(&corpus)?;

    // Tokenize text
    let tokens = tokenizer.encode("hello world")?;
    let decoded = tokenizer.decode(&tokens)?;
    Ok(())
}

§HuggingFace Integration

Load pre-trained tokenizers from HuggingFace tokenizer.json files:

use entrenar::tokenizer::HfTokenizer;

fn example() -> Result<(), Box<dyn std::error::Error>> {
    // Load from HuggingFace tokenizer.json
    let tokenizer = HfTokenizer::from_file("path/to/tokenizer.json")?;
    let tokens = tokenizer.encode("Hello, world!");
    Ok(())
}

Structs§

BPETokenizer: BPE (Byte Pair Encoding) tokenizer
CharTokenizer: Character-level tokenizer (simple baseline)
HfBpeConfig: BPE tokenizer configuration.
HfBpeTokenizer: Byte Pair Encoding tokenizer.
HfTokenizer: HuggingFace-compatible tokenizer wrapper
MergeRule: A BPE merge rule (pair → merged token).
Qwen2BpeTokenizer: Qwen2-specific BPE tokenizer with chat template support.
SpecialTokens: Special tokens
TokenizerConfig: Tokenizer configuration

Enums§

Normalization: Unicode normalization mode applied before byte-level encoding.
TokenizerError: Tokenizer errors
TokenizerType: Tokenizer type

Traits§

Tokenizer: Tokenizer trait

Functions§

bytes_to_unicode: Create byte to Unicode character mapping.
load_hf_from_files: Load tokenizer from vocab.json and merges.txt files.
load_hf_from_json: Load tokenizer from HuggingFace tokenizer.json format.

Type Aliases§

Result: Result type for tokenizer operations
TokenId: Token ID type

Module tokenizer

Module tokenizer Copy item path

§Toyota Principle: Just-in-Time (ジャスト・イン・タイム)

§Example

§HuggingFace Integration

Structs§

Enums§

Traits§

Functions§

Type Aliases§

Module tokenizer