1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
//! Gemma 3 tokenizer for the local in-browser model backend.
//!
//! Thin wrapper over HuggingFace's `tokenizers` crate, loaded from raw
//! `tokenizer.json` bytes (`include_bytes!`, an OPFS read, or a CDN fetch —
//! no filesystem or network dependency in here). Compiles on BOTH native and
//! `wasm32-unknown-unknown`: the crate is pulled with
//! `default-features = false, features = ["unstable_wasm"]`, which swaps the C
//! `onig` regex for pure-Rust `fancy-regex` and points `getrandom` at the
//! browser WebCrypto backend — the candle-wasm-examples recipe.
//!
//! ## BOS handling
//!
//! Gemma special tokens: `pad = 0`, `eos = 1`, `bos = 2`, `unk = 3`; vocab
//! 262144. The model expects a single leading `<bos>` (id 2). Gemma's
//! `tokenizer.json` ships a `TemplateProcessing` post-processor that *also*
//! prepends `<bos>` when `encode(text, add_special_tokens = true)` is used — so
//! using that path AND manually prepending would yield a doubled BOS and
//! corrupt the first-token statistics.
//!
//! To make the contract (`encode` prepends BOS=2) unambiguous regardless of
//! whether the loaded json carries that post-processor, this wrapper encodes
//! with `add_special_tokens = false` (no auto-specials) and prepends exactly
//! one BOS by hand. Result: precisely one `<bos>` at the front, always.
//!
//! ## Type bridge
//!
//! The model's `forward` takes `Tensor<B, 2, Int>` (i64-shaped token ids); the
//! `tokenizers` crate speaks `u32`. `encode` returns `Vec<i64>` and `decode`
//! takes `&[i64]`, converting at the boundary. Negative ids (none should ever
//! occur from the model's argmax over a 262144 vocab) are dropped on decode.
use Tokenizer;
/// Gemma `<pad>` token id.
pub const GEMMA_PAD: i64 = 0;
/// Gemma `<eos>` token id (greedy generation stops here).
pub const GEMMA_EOS: i64 = 1;
/// Gemma `<bos>` token id (prepended by [`GemmaTokenizer::encode`]).
pub const GEMMA_BOS: i64 = 2;
/// Gemma `<unk>` token id.
pub const GEMMA_UNK: i64 = 3;
/// A loaded Gemma 3 tokenizer. Construct via [`load`].
/// Load a [`GemmaTokenizer`] from raw `tokenizer.json` bytes.
///
/// `bytes` is the full HuggingFace fast-tokenizer JSON (Gemma's is ~33 MB).
/// No filesystem or network is touched — feed it `include_bytes!` output, an
/// `OpfsFilesystem::read` result, or a CDN `fetch`. wasm-clean.