Skip to main content

Module tokenizer

Module tokenizer 

Source
Expand description

Subword Tokenization Module (#26)

Just-in-Time tokenization for training pipelines with BPE and WordPiece support. Includes integration with aprender for HuggingFace-compatible tokenizer loading.

§Toyota Principle: Just-in-Time (ジャスト・イン・タイム)

Tokenize on demand during training, not upfront - reducing memory footprint and enabling dynamic vocabulary adaptation.

§Example

use entrenar::tokenizer::{BPETokenizer, Tokenizer, TokenizerConfig};

fn example() -> Result<(), Box<dyn std::error::Error>> {
    // Create a BPE tokenizer
    let config = TokenizerConfig::bpe().with_vocab_size(1000);
    let mut tokenizer = BPETokenizer::new(config);

    // Train on corpus
    let corpus = vec!["hello world", "hello there"];
    tokenizer.train(&corpus)?;

    // Tokenize text
    let tokens = tokenizer.encode("hello world")?;
    let decoded = tokenizer.decode(&tokens)?;
    Ok(())
}

§HuggingFace Integration

Load pre-trained tokenizers from HuggingFace tokenizer.json files:

use entrenar::tokenizer::HfTokenizer;

fn example() -> Result<(), Box<dyn std::error::Error>> {
    // Load from HuggingFace tokenizer.json
    let tokenizer = HfTokenizer::from_file("path/to/tokenizer.json")?;
    let tokens = tokenizer.encode("Hello, world!");
    Ok(())
}

Structs§

BPETokenizer
BPE (Byte Pair Encoding) tokenizer
CharTokenizer
Character-level tokenizer (simple baseline)
HfBpeConfig
BPE tokenizer configuration.
HfBpeTokenizer
Byte Pair Encoding tokenizer.
HfTokenizer
HuggingFace-compatible tokenizer wrapper
MergeRule
A BPE merge rule (pair → merged token).
Qwen2BpeTokenizer
Qwen2-specific BPE tokenizer with chat template support.
SpecialTokens
Special tokens
TokenizerConfig
Tokenizer configuration

Enums§

Normalization
Unicode normalization mode applied before byte-level encoding.
TokenizerError
Tokenizer errors
TokenizerType
Tokenizer type

Traits§

Tokenizer
Tokenizer trait

Functions§

bytes_to_unicode
Create byte to Unicode character mapping.
load_hf_from_files
Load tokenizer from vocab.json and merges.txt files.
load_hf_from_json
Load tokenizer from HuggingFace tokenizer.json format.

Type Aliases§

Result
Result type for tokenizer operations
TokenId
Token ID type