1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
//! # oxibonsai-tokenizer
//!
//! Pure Rust BPE tokenizer for OxiBonsai — MeCrab-compatible, WASM-safe.
//!
//! This crate is a **stub implementation** that will eventually replace the
//! HuggingFace `tokenizers` dependency in [`oxibonsai-runtime`]. It provides:
//!
//! - [`OxiTokenizer`] — high-level encode/decode API
//! - [`Vocabulary`] — bidirectional token ↔ ID mapping with special-token support
//! - [`BpeMerges`] — ordered BPE merge table
//! - [`bpe_encode`] / [`pretokenize`] — core BPE primitives
//! - [`byte_fallback_id`] — `<0xHH>` byte-fallback helper
//! - [`TokenizerError`] / [`TokenizerResult`] — error types
//!
//! ## Quick start (char-level stub — no vocab file needed)
//!
//! ```rust
//! use oxibonsai_tokenizer::OxiTokenizer;
//!
//! let tok = OxiTokenizer::char_level_stub(256);
//! let ids = tok.encode("Hello!").expect("encode should succeed");
//! assert!(!ids.is_empty());
//! ```
//!
//! ## Loading from JSON vocab + merges
//!
//! ```rust
//! use oxibonsai_tokenizer::{OxiTokenizer, TokenizerConfig};
//!
//! let vocab_json = r#"{"a":10,"b":11,"ab":20,"<unk>":0,"<bos>":1,"<eos>":2,"<pad>":3}"#;
//! let merges_json = r#"[["a","b"]]"#;
//! let tok = OxiTokenizer::from_json(vocab_json, merges_json, TokenizerConfig::default())
//! .expect("loading should succeed");
//! assert_eq!(tok.vocab_size(), 7);
//! ```
// Re-export the most commonly used types at the crate root.
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use Vocabulary;