Expand description
Subword Tokenization Module (#26)
Just-in-Time tokenization for training pipelines with BPE and WordPiece support. Includes integration with aprender for HuggingFace-compatible tokenizer loading.
§Toyota Principle: Just-in-Time (ジャスト・イン・タイム)
Tokenize on demand during training, not upfront - reducing memory footprint and enabling dynamic vocabulary adaptation.
§Example
use entrenar::tokenizer::{BPETokenizer, Tokenizer, TokenizerConfig};
fn example() -> Result<(), Box<dyn std::error::Error>> {
// Create a BPE tokenizer
let config = TokenizerConfig::bpe().with_vocab_size(1000);
let mut tokenizer = BPETokenizer::new(config);
// Train on corpus
let corpus = vec!["hello world", "hello there"];
tokenizer.train(&corpus)?;
// Tokenize text
let tokens = tokenizer.encode("hello world")?;
let decoded = tokenizer.decode(&tokens)?;
Ok(())
}§HuggingFace Integration
Load pre-trained tokenizers from HuggingFace tokenizer.json files:
ⓘ
use entrenar::tokenizer::HfTokenizer;
fn example() -> Result<(), Box<dyn std::error::Error>> {
// Load from HuggingFace tokenizer.json
let tokenizer = HfTokenizer::from_file("path/to/tokenizer.json")?;
let tokens = tokenizer.encode("Hello, world!");
Ok(())
}Structs§
- BPETokenizer
- BPE (Byte Pair Encoding) tokenizer
- Char
Tokenizer - Character-level tokenizer (simple baseline)
- HfBpe
Config - BPE tokenizer configuration.
- HfBpe
Tokenizer - Byte Pair Encoding tokenizer.
- HfTokenizer
- HuggingFace-compatible tokenizer wrapper
- Merge
Rule - A BPE merge rule (pair → merged token).
- Qwen2
BpeTokenizer - Qwen2-specific BPE tokenizer with chat template support.
- Special
Tokens - Special tokens
- Tokenizer
Config - Tokenizer configuration
Enums§
- Normalization
- Unicode normalization mode applied before byte-level encoding.
- Tokenizer
Error - Tokenizer errors
- Tokenizer
Type - Tokenizer type
Traits§
- Tokenizer
- Tokenizer trait
Functions§
- bytes_
to_ unicode - Create byte to Unicode character mapping.
- load_
hf_ from_ files - Load tokenizer from vocab.json and merges.txt files.
- load_
hf_ from_ json - Load tokenizer from
HuggingFacetokenizer.json format.