oxibonsai-tokenizer 0.1.3

Pure Rust BPE tokenizer for OxiBonsai (MeCrab-compatible)
Documentation

oxibonsai-tokenizer

Version: 0.1.3 · Status: Stable · Tests: 268 passing

Pure Rust BPE tokenizer for OxiBonsai — WASM-safe, zero FFI.

Implements a byte-pair encoding (BPE) tokenizer with vocabulary management, BPE merge rules, ChatTemplate formatting (chatml), byte-fallback encoding, JSON serialization, a BPE trainer for building new vocabularies, and a streaming decoder for incremental token output.

Part of the OxiBonsai project.

Features

  • OxiTokenizer — encode, decode, batch encode/decode
  • Vocabulary — bidirectional token <-> id mapping, special token support
  • BpeMerges — merge rule table with priority lookup
  • ChatTemplate — chatml-style prompt formatting
  • ChatTemplateRegistry — named registry of chat prompt templates
  • HfTokenizerJson — HuggingFace tokenizer format parser
  • StreamingDecoder — incremental token-by-token decoding
  • Byte-fallback encoding for out-of-vocabulary bytes
  • TokenizerSerializersave_json / load_json roundtrip
  • BpeTrainer / TrainerConfig — build vocabularies from text corpora
  • Benchmark suite and extended Unicode edge-case tests
  • WASM-safe: no C/FFI dependencies

Usage

[dependencies]
oxibonsai-tokenizer = "0.1.3"
use oxibonsai_tokenizer::OxiTokenizer;

let tokenizer = OxiTokenizer::load("tokenizer.json")?;
let ids = tokenizer.encode("Hello, world!")?;
let text = tokenizer.decode(&ids)?;

License

Apache-2.0 — COOLJAPAN OU