oxibonsai-tokenizer
Pure Rust BPE tokenizer for OxiBonsai — WASM-safe, zero FFI.
Implements a byte-pair encoding (BPE) tokenizer with vocabulary management, BPE merge rules, ChatTemplate formatting (chatml), byte-fallback encoding, JSON serialization, and a BPE trainer for building new vocabularies.
Part of the OxiBonsai project.
Features
OxiTokenizer— encode, decode, batch encode/decodeVocabulary— bidirectional token <-> id mapping, special token supportBpeMerges— merge rule table with priority lookupChatTemplate— chatml-style prompt formatting- Byte-fallback encoding for out-of-vocabulary bytes
TokenizerSerializer—save_json/load_jsonroundtripBpeTrainer/TrainerConfig— build vocabularies from text corpora- WASM-safe: no C/FFI dependencies
Usage
[]
= "0.1.0"
use OxiTokenizer;
let tokenizer = load?;
let ids = tokenizer.encode?;
let text = tokenizer.decode?;
License
Apache-2.0 — COOLJAPAN OU