oxibonsai-tokenizer
Version: 0.1.1 · Status: Alpha · Tests: 85 passing
Pure Rust BPE tokenizer for OxiBonsai — WASM-safe, zero FFI.
Implements a byte-pair encoding (BPE) tokenizer with vocabulary management, BPE merge rules, ChatTemplate formatting (chatml), byte-fallback encoding, JSON serialization, and a BPE trainer for building new vocabularies.
Functional and in active use within OxiBonsai, but still maturing relative
to the HuggingFace tokenizers crate (hence Alpha).
Part of the OxiBonsai project.
Features
OxiTokenizer— encode, decode, batch encode/decodeVocabulary— bidirectional token <-> id mapping, special token supportBpeMerges— merge rule table with priority lookupChatTemplate— chatml-style prompt formatting- Byte-fallback encoding for out-of-vocabulary bytes
TokenizerSerializer—save_json/load_jsonroundtripBpeTrainer/TrainerConfig— build vocabularies from text corpora- WASM-safe: no C/FFI dependencies
Usage
[]
= "0.1.1"
use OxiTokenizer;
let tokenizer = load?;
let ids = tokenizer.encode?;
let text = tokenizer.decode?;
License
Apache-2.0 — COOLJAPAN OU