Expand description
§oxibonsai-tokenizer
Pure Rust BPE tokenizer for OxiBonsai — MeCrab-compatible, WASM-safe.
This crate is a production-ready BPE implementation that can load
HuggingFace tokenizer.json files (Qwen3, Llama-3, Mistral, Gemma, …)
directly without pulling in the tokenizers crate. Features:
OxiTokenizer— high-level encode/decode APIVocabulary— bidirectional token ↔ ID mapping with special-token supportBpeMerges— ordered BPE merge tablebpe_encode/pretokenize— core BPE primitivesbyte_fallback_id—<0xHH>byte-fallback helperTokenizerError/TokenizerResult— error typeshf_format::HfTokenizerJson— HuggingFacetokenizer.jsonparserstreaming::StreamingDecoder— UTF-8-safe streaming decoderchat_templates::ChatTemplateKind— canned templates for ChatML, Llama-3, Mistral, Gemma and Qwen
§Quick start (character-level mode — no trained vocab required)
use oxibonsai_tokenizer::OxiTokenizer;
let tok = OxiTokenizer::char_level_stub(256);
let ids = tok.encode("Hello!").expect("encode should succeed");
assert!(!ids.is_empty());§Loading from JSON vocab + merges
use oxibonsai_tokenizer::{OxiTokenizer, TokenizerConfig};
let vocab_json = r#"{"a":10,"b":11,"ab":20,"<unk>":0,"<bos>":1,"<eos>":2,"<pad>":3}"#;
let merges_json = r#"[["a","b"]]"#;
let tok = OxiTokenizer::from_json(vocab_json, merges_json, TokenizerConfig::default())
.expect("loading should succeed");
assert_eq!(tok.vocab_size(), 7);§Loading from a HuggingFace tokenizer.json
use oxibonsai_tokenizer::OxiTokenizer;
let tok = OxiTokenizer::from_json_file("tokenizer.json")
.expect("HF tokenizer should load");
let ids = tok.encode("Hello!").expect("encode");
let text = tok.decode(&ids).expect("decode");
assert_eq!(text, "Hello!");Re-exports§
pub use bpe::bpe_encode;pub use bpe::byte_fallback_id;pub use bpe::pretokenize;pub use bpe::BpeMerges;pub use chat_templates::ChatMessage;pub use chat_templates::ChatTemplateKind;pub use error::TokenizerError;pub use error::TokenizerResult;pub use hf_format::byte_to_unicode;pub use hf_format::bytes_to_unicode_map;pub use hf_format::unicode_to_byte;pub use hf_format::HfModelType;pub use hf_format::HfTokenizerJson;pub use serialization::base64_decode;pub use serialization::base64_encode;pub use serialization::SerializationError;pub use serialization::TokenizerState;pub use serialization::FORMAT_MAGIC;pub use streaming::StreamingDecoder;pub use tokenizer::OxiTokenizer;pub use tokenizer::TokenizerConfig;pub use trainer::BpeTrainer;pub use trainer::MergeRule;pub use trainer::SymbolPair;pub use trainer::TrainedTokenizer;pub use trainer::TrainerConfig;pub use trainer::TrainerError;pub use trainer::TrainingStats;pub use unigram::UnigramError;pub use unigram::UnigramVocab;pub use vocab::Vocabulary;pub use wordpiece::WordPieceError;pub use wordpiece::WordPieceVocab;pub use wordpiece::WORDPIECE_CONTINUATION_PREFIX;
Modules§
- bpe
- Byte-Pair Encoding (BPE) merge table and encoding routines.
- chat_
templates - Canned chat-template registry covering the five major open-weight instruction-tuned families.
- error
- Error types for the OxiBonsai tokenizer.
- hf_
format - HuggingFace
tokenizer.jsonformat parser (BPE, Unigram, and WordPiece models). - serialization
- Tokenizer serialization: save and load tokenizer state to/from text files.
- streaming
- UTF-8-safe streaming decoder.
- tests
- Integration tests for
oxibonsai-tokenizer. - tokenizer
- High-level OxiBonsai tokenizer: BPE + Unigram + WordPiece + char-level fallback.
- trainer
- BPE tokenizer trainer: learn merge rules from a text corpus.
- unigram
- Viterbi-based Unigram tokenizer for OxiBonsai.
- utils
- Tokenizer utilities: normalization, special token handling, chat templates.
- vocab
- Vocabulary management for the OxiBonsai tokenizer.
- wordpiece
- WordPiece tokenizer — greedy longest-match segmentation for BERT/RoBERTa/DeBERTa.