Skip to main content

Crate oxibonsai_tokenizer

Crate oxibonsai_tokenizer 

Source
Expand description

§oxibonsai-tokenizer

Pure Rust BPE tokenizer for OxiBonsai — MeCrab-compatible, WASM-safe.

This crate is a production-ready BPE implementation that can load HuggingFace tokenizer.json files (Qwen3, Llama-3, Mistral, Gemma, …) directly without pulling in the tokenizers crate. Features:

§Quick start (character-level mode — no trained vocab required)

use oxibonsai_tokenizer::OxiTokenizer;

let tok = OxiTokenizer::char_level_stub(256);
let ids = tok.encode("Hello!").expect("encode should succeed");
assert!(!ids.is_empty());

§Loading from JSON vocab + merges

use oxibonsai_tokenizer::{OxiTokenizer, TokenizerConfig};

let vocab_json = r#"{"a":10,"b":11,"ab":20,"<unk>":0,"<bos>":1,"<eos>":2,"<pad>":3}"#;
let merges_json = r#"[["a","b"]]"#;
let tok = OxiTokenizer::from_json(vocab_json, merges_json, TokenizerConfig::default())
    .expect("loading should succeed");
assert_eq!(tok.vocab_size(), 7);

§Loading from a HuggingFace tokenizer.json

use oxibonsai_tokenizer::OxiTokenizer;

let tok = OxiTokenizer::from_json_file("tokenizer.json")
    .expect("HF tokenizer should load");
let ids = tok.encode("Hello!").expect("encode");
let text = tok.decode(&ids).expect("decode");
assert_eq!(text, "Hello!");

Re-exports§

pub use bpe::bpe_encode;
pub use bpe::byte_fallback_id;
pub use bpe::pretokenize;
pub use bpe::BpeMerges;
pub use chat_templates::ChatMessage;
pub use chat_templates::ChatTemplateKind;
pub use error::TokenizerError;
pub use error::TokenizerResult;
pub use hf_format::byte_to_unicode;
pub use hf_format::bytes_to_unicode_map;
pub use hf_format::unicode_to_byte;
pub use hf_format::HfModelType;
pub use hf_format::HfTokenizerJson;
pub use serialization::base64_decode;
pub use serialization::base64_encode;
pub use serialization::SerializationError;
pub use serialization::TokenizerState;
pub use serialization::FORMAT_MAGIC;
pub use streaming::StreamingDecoder;
pub use tokenizer::OxiTokenizer;
pub use tokenizer::TokenizerConfig;
pub use trainer::BpeTrainer;
pub use trainer::MergeRule;
pub use trainer::SymbolPair;
pub use trainer::TrainedTokenizer;
pub use trainer::TrainerConfig;
pub use trainer::TrainerError;
pub use trainer::TrainingStats;
pub use unigram::UnigramError;
pub use unigram::UnigramVocab;
pub use vocab::Vocabulary;
pub use wordpiece::WordPieceError;
pub use wordpiece::WordPieceVocab;
pub use wordpiece::WORDPIECE_CONTINUATION_PREFIX;

Modules§

bpe
Byte-Pair Encoding (BPE) merge table and encoding routines.
chat_templates
Canned chat-template registry covering the five major open-weight instruction-tuned families.
error
Error types for the OxiBonsai tokenizer.
hf_format
HuggingFace tokenizer.json format parser (BPE, Unigram, and WordPiece models).
serialization
Tokenizer serialization: save and load tokenizer state to/from text files.
streaming
UTF-8-safe streaming decoder.
tests
Integration tests for oxibonsai-tokenizer.
tokenizer
High-level OxiBonsai tokenizer: BPE + Unigram + WordPiece + char-level fallback.
trainer
BPE tokenizer trainer: learn merge rules from a text corpus.
unigram
Viterbi-based Unigram tokenizer for OxiBonsai.
utils
Tokenizer utilities: normalization, special token handling, chat templates.
vocab
Vocabulary management for the OxiBonsai tokenizer.
wordpiece
WordPiece tokenizer — greedy longest-match segmentation for BERT/RoBERTa/DeBERTa.