Skip to main content

Crate tokie

Crate tokie 

Source
Expand description

tokie — Fast, correct tokenizer for every HuggingFace model.

Supports BPE, WordPiece, SentencePiece, and Unigram. 50x faster than HuggingFace tokenizers, 100% token-accurate.

§Quick Start

use tokie::Tokenizer;

let tokenizer = Tokenizer::from_json("tokenizer.json")?;

// Encode returns Encoding with ids, attention_mask, type_ids
let enc = tokenizer.encode("Hello, world!", true);
println!("{:?}", enc.ids);

// Decode back
let text = tokenizer.decode(&enc.ids).unwrap();

// Save/load binary format (~10x smaller, ~5ms load)
tokenizer.to_file("model.tkz")?;
let tokenizer = Tokenizer::from_file("model.tkz")?;

§HuggingFace Hub

Enable the hf feature to load from HuggingFace directly:

tokie = { version = "0.0.4", features = ["hf"] }
let tokenizer = Tokenizer::from_pretrained("bert-base-uncased")?;
let tokenizer = Tokenizer::from_pretrained("meta-llama/Llama-3.2-1B")?;

§Padding & Truncation

use tokie::{Tokenizer, TruncationParams, PaddingParams, PaddingStrategy};

let mut tokenizer = Tokenizer::from_pretrained("bert-base-uncased")?;
tokenizer.enable_truncation(TruncationParams { max_length: 128, ..Default::default() });
tokenizer.enable_padding(PaddingParams {
    strategy: PaddingStrategy::Fixed(128),
    ..Default::default()
});

let results = tokenizer.encode_batch(&["Hello!", "World"], true);
// All results are exactly 128 tokens

Re-exports§

pub use encoder::BacktrackingBytePairEncoder;
pub use encoder::BytePairEncoder;
pub use encoder::EncodeIter;
pub use encoder::Encoder;
pub use encoder::EncoderIter;
pub use encoder::EncoderType;
pub use hf::JsonLoadError;
pub use normalizer::bert_uncased_normalize;
pub use normalizer::clean_text;
pub use normalizer::fnr;
pub use normalizer::metaspace_normalize;
pub use normalizer::strip_accents;
pub use normalizer::FnrFinder;
pub use normalizer::Normalizer;
pub use padding::Encoding;
pub use padding::PaddingParams;
pub use padding::PaddingStrategy;
pub use padding::PaddingDirection;
pub use padding::TruncationParams;
pub use padding::TruncationStrategy;
pub use padding::TruncationDirection;
pub use pretok::Pretok;
pub use pretok::PretokIter;
pub use pretok::PretokType;
pub use pretok::RegexPretok;

Modules§

diff
Tokenization diff tool for comparing two encodings.
encoder
Token encoders for different tokenization algorithms.
hf
HuggingFace tokenizer.json loading support.
normalizer
Text normalization for tokenizers.
padding
Padding and truncation support for tokenizer output.
pretok
Fast pre-tokenization for BPE tokenizers.

Structs§

Decoder
High-level decoder wrapping VocabDecoder (byte lookup) and DecoderType (text post-processing).
TokenCount
Lazy token count that supports comparison with usize. Each TokenCount can only be compared once (the iterator is consumed).
TokenizeIter
Iterator over tokens from the high-level Tokenizer.
Tokenizer
High-level tokenizer combining pre-tokenization, encoding, and decoding.
VocabDecoder
VocabDecoder for converting token IDs back to bytes.

Enums§

DecoderType
Text-level decoder type, mirroring EncoderType.
PostProcessor
Post-processor configuration.
SerdeError
Error type for serialization/deserialization.

Type Aliases§

EncodingPair
Backward-compatible alias for Encoding.
TokenId
Token identifier - corresponds to position in vocabulary