Expand description
tokie — Fast, correct tokenizer for every HuggingFace model.
Supports BPE, WordPiece, SentencePiece, and Unigram. 50x faster than HuggingFace tokenizers, 100% token-accurate.
§Quick Start
ⓘ
use tokie::Tokenizer;
let tokenizer = Tokenizer::from_json("tokenizer.json")?;
// Encode returns Encoding with ids, attention_mask, type_ids
let enc = tokenizer.encode("Hello, world!", true);
println!("{:?}", enc.ids);
// Decode back
let text = tokenizer.decode(&enc.ids).unwrap();
// Save/load binary format (~10x smaller, ~5ms load)
tokenizer.to_file("model.tkz")?;
let tokenizer = Tokenizer::from_file("model.tkz")?;§HuggingFace Hub
Enable the hf feature to load from HuggingFace directly:
tokie = { version = "0.0.4", features = ["hf"] }ⓘ
let tokenizer = Tokenizer::from_pretrained("bert-base-uncased")?;
let tokenizer = Tokenizer::from_pretrained("meta-llama/Llama-3.2-1B")?;§Padding & Truncation
ⓘ
use tokie::{Tokenizer, TruncationParams, PaddingParams, PaddingStrategy};
let mut tokenizer = Tokenizer::from_pretrained("bert-base-uncased")?;
tokenizer.enable_truncation(TruncationParams { max_length: 128, ..Default::default() });
tokenizer.enable_padding(PaddingParams {
strategy: PaddingStrategy::Fixed(128),
..Default::default()
});
let results = tokenizer.encode_batch(&["Hello!", "World"], true);
// All results are exactly 128 tokensRe-exports§
pub use encoder::BacktrackingBytePairEncoder;pub use encoder::BytePairEncoder;pub use encoder::EncodeIter;pub use encoder::Encoder;pub use encoder::EncoderIter;pub use encoder::EncoderType;pub use hf::JsonLoadError;pub use normalizer::bert_uncased_normalize;pub use normalizer::clean_text;pub use normalizer::fnr;pub use normalizer::metaspace_normalize;pub use normalizer::strip_accents;pub use normalizer::FnrFinder;pub use normalizer::Normalizer;pub use padding::Encoding;pub use padding::PaddingParams;pub use padding::PaddingStrategy;pub use padding::PaddingDirection;pub use padding::TruncationParams;pub use padding::TruncationStrategy;pub use padding::TruncationDirection;pub use pretok::Pretok;pub use pretok::PretokIter;pub use pretok::PretokType;pub use pretok::RegexPretok;
Modules§
- diff
- Tokenization diff tool for comparing two encodings.
- encoder
- Token encoders for different tokenization algorithms.
- hf
- HuggingFace tokenizer.json loading support.
- normalizer
- Text normalization for tokenizers.
- padding
- Padding and truncation support for tokenizer output.
- pretok
- Fast pre-tokenization for BPE tokenizers.
Structs§
- Decoder
- High-level decoder wrapping
VocabDecoder(byte lookup) andDecoderType(text post-processing). - Token
Count - Lazy token count that supports comparison with
usize. EachTokenCountcan only be compared once (the iterator is consumed). - Tokenize
Iter - Iterator over tokens from the high-level Tokenizer.
- Tokenizer
- High-level tokenizer combining pre-tokenization, encoding, and decoding.
- Vocab
Decoder - VocabDecoder for converting token IDs back to bytes.
Enums§
- Decoder
Type - Text-level decoder type, mirroring
EncoderType. - Post
Processor - Post-processor configuration.
- Serde
Error - Error type for serialization/deserialization.
Type Aliases§
- Encoding
Pair - Backward-compatible alias for
Encoding. - TokenId
- Token identifier - corresponds to position in vocabulary