Crate oxibonsai_tokenizer

Expand description

§oxibonsai-tokenizer

Pure Rust BPE tokenizer for OxiBonsai — MeCrab-compatible, WASM-safe.

This crate is a production-ready BPE implementation that can load HuggingFace tokenizer.json files (Qwen3, Llama-3, Mistral, Gemma, …) directly without pulling in the tokenizers crate. Features:

OxiTokenizer — high-level encode/decode API
Vocabulary — bidirectional token ↔ ID mapping with special-token support
BpeMerges — ordered BPE merge table
bpe_encode / pretokenize — core BPE primitives
byte_fallback_id — <0xHH> byte-fallback helper
TokenizerError / TokenizerResult — error types
hf_format::HfTokenizerJson — HuggingFace tokenizer.json parser
streaming::StreamingDecoder — UTF-8-safe streaming decoder
chat_templates::ChatTemplateKind — canned templates for ChatML, Llama-3, Mistral, Gemma and Qwen

§Quick start (character-level mode — no trained vocab required)

use oxibonsai_tokenizer::OxiTokenizer;

let tok = OxiTokenizer::char_level_stub(256);
let ids = tok.encode("Hello!").expect("encode should succeed");
assert!(!ids.is_empty());

§Loading from JSON vocab + merges

use oxibonsai_tokenizer::{OxiTokenizer, TokenizerConfig};

let vocab_json = r#"{"a":10,"b":11,"ab":20,"<unk>":0,"<bos>":1,"<eos>":2,"<pad>":3}"#;
let merges_json = r#"[["a","b"]]"#;
let tok = OxiTokenizer::from_json(vocab_json, merges_json, TokenizerConfig::default())
    .expect("loading should succeed");
assert_eq!(tok.vocab_size(), 7);

§Loading from a HuggingFace `tokenizer.json`

use oxibonsai_tokenizer::OxiTokenizer;

let tok = OxiTokenizer::from_json_file("tokenizer.json")
    .expect("HF tokenizer should load");
let ids = tok.encode("Hello!").expect("encode");
let text = tok.decode(&ids).expect("decode");
assert_eq!(text, "Hello!");

Re-exports§

pub use bpe::bpe_encode;
pub use bpe::byte_fallback_id;
pub use bpe::pretokenize;
pub use bpe::BpeMerges;
pub use chat_templates::ChatMessage;
pub use chat_templates::ChatTemplateKind;
pub use error::TokenizerError;
pub use error::TokenizerResult;
pub use hf_format::byte_to_unicode;
pub use hf_format::bytes_to_unicode_map;
pub use hf_format::unicode_to_byte;
pub use hf_format::HfModelType;
pub use hf_format::HfTokenizerJson;
pub use serialization::base64_decode;
pub use serialization::base64_encode;
pub use serialization::SerializationError;
pub use serialization::TokenizerState;
pub use serialization::FORMAT_MAGIC;
pub use streaming::StreamingDecoder;
pub use tokenizer::OxiTokenizer;
pub use tokenizer::TokenizerConfig;
pub use trainer::BpeTrainer;
pub use trainer::MergeRule;
pub use trainer::SymbolPair;
pub use trainer::TrainedTokenizer;
pub use trainer::TrainerConfig;
pub use trainer::TrainerError;
pub use trainer::TrainingStats;
pub use unigram::UnigramError;
pub use unigram::UnigramVocab;
pub use vocab::Vocabulary;
pub use wordpiece::WordPieceError;
pub use wordpiece::WordPieceVocab;
pub use wordpiece::WORDPIECE_CONTINUATION_PREFIX;

Modules§

bpe: Byte-Pair Encoding (BPE) merge table and encoding routines.
chat_templates: Canned chat-template registry covering the five major open-weight instruction-tuned families.
error: Error types for the OxiBonsai tokenizer.
hf_format: HuggingFace tokenizer.json format parser (BPE, Unigram, and WordPiece models).
serialization: Tokenizer serialization: save and load tokenizer state to/from text files.
streaming: UTF-8-safe streaming decoder.
tests: Integration tests for oxibonsai-tokenizer.
tokenizer: High-level OxiBonsai tokenizer: BPE + Unigram + WordPiece + char-level fallback.
trainer: BPE tokenizer trainer: learn merge rules from a text corpus.
unigram: Viterbi-based Unigram tokenizer for OxiBonsai.
utils: Tokenizer utilities: normalization, special token handling, chat templates.
vocab: Vocabulary management for the OxiBonsai tokenizer.
wordpiece: WordPiece tokenizer — greedy longest-match segmentation for BERT/RoBERTa/DeBERTa.

Crate oxibonsai_tokenizer

Crate oxibonsai_tokenizer Copy item path

§oxibonsai-tokenizer

§Quick start (character-level mode — no trained vocab required)

§Loading from JSON vocab + merges

§Loading from a HuggingFace tokenizer.json

Re-exports§

Modules§

Crate oxibonsai_tokenizer

§Loading from a HuggingFace `tokenizer.json`