1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
//! LLaMA/TinyLlama SentencePiece-style BPE tokenizer (GH-145).
//!
//! Provides tokenization for LLaMA-family models:
//! - Load vocabulary from GGUF metadata
//! - SentencePiece-style BPE encoding
//! - Decode token IDs to text
//! - Special token handling (BOS, EOS, UNK, PAD)
//!
//! # Architecture
//!
//! ```text
//! Text → Normalize → BPE Encode → Token IDs
//! ↓
//! GGUF metadata (tokens, scores, merges)
//! ```
//!
//! # Example
//!
//! ```rust,ignore
//! use aprender::text::llama_tokenizer::LlamaTokenizer;
//!
//! let tokenizer = LlamaTokenizer::from_gguf_bytes(&gguf_data)?;
//! let tokens = tokenizer.encode("Hello, world!");
//! let decoded = tokenizer.decode(&tokens);
//! assert_eq!(decoded, "Hello, world!");
//! ```
//!
//! # PMAT Compliance
//!
//! - Zero `unwrap()` calls
//! - All public APIs return `Result<T, E>` where fallible
//! - Toyota P5 Jidoka: fail-fast on invalid input
use HashMap;
use crate;
// ============================================================================
// Constants
// ============================================================================
/// `LLaMA` vocabulary size (standard)
pub const LLAMA_VOCAB_SIZE: usize = 32000;
/// Special token: Beginning of sequence
pub const BOS_TOKEN: &str = "<s>";
/// Special token: End of sequence
pub const EOS_TOKEN: &str = "</s>";
/// Special token: Unknown
pub const UNK_TOKEN: &str = "<unk>";
/// Byte fallback prefix (`SentencePiece` style)
const BYTE_FALLBACK_PREFIX: &str = "<0x";
// ============================================================================
// GPT-2 BPE Byte Decoding
// ============================================================================
/// Build the GPT-2 byte-to-unicode mapping.
///
/// GPT-2 BPE encodes bytes as unicode characters to ensure all byte sequences
/// are valid unicode. This maps the special unicode chars back to bytes.
///
/// The mapping is:
/// - Printable ASCII without space (33-126): Maps to same codepoint
/// - All other bytes (0-32, 127-255): Maps to U+0100..U+01FF range
/// Decode a GPT-2 BPE token string to actual bytes.
///
/// GPT-2 BPE uses unicode characters to represent bytes. This function
/// converts the unicode representation back to the original bytes.
// ============================================================================
// LlamaTokenizer
// ============================================================================
/// Tokenizer model type (from GGUF metadata)
/// LLaMA/TinyLlama tokenizer using SentencePiece-style BPE.
///
/// Also supports GPT-2 BPE for models like Qwen2.5.
///
/// Loads vocabulary and merges from GGUF metadata.
include!;
include!;