oxibonsai-tokenizer
Version: 0.1.2 · Status: Stable · Tests: 268 passing
Pure Rust BPE tokenizer for OxiBonsai — WASM-safe, zero FFI.
Implements a byte-pair encoding (BPE) tokenizer with vocabulary management, BPE merge rules, ChatTemplate formatting (chatml), byte-fallback encoding, JSON serialization, a BPE trainer for building new vocabularies, and a streaming decoder for incremental token output.
Part of the OxiBonsai project.
Features
OxiTokenizer— encode, decode, batch encode/decodeVocabulary— bidirectional token <-> id mapping, special token supportBpeMerges— merge rule table with priority lookupChatTemplate— chatml-style prompt formattingChatTemplateRegistry— named registry of chat prompt templatesHfTokenizerJson— HuggingFace tokenizer format parserStreamingDecoder— incremental token-by-token decoding- Byte-fallback encoding for out-of-vocabulary bytes
TokenizerSerializer—save_json/load_jsonroundtripBpeTrainer/TrainerConfig— build vocabularies from text corpora- Benchmark suite and extended Unicode edge-case tests
- WASM-safe: no C/FFI dependencies
Usage
[]
= "0.1.2"
use OxiTokenizer;
let tokenizer = load?;
let ids = tokenizer.encode?;
let text = tokenizer.decode?;
License
Apache-2.0 — COOLJAPAN OU