pub struct TokenizerMap {Show 14 fields
pub id: String,
pub version: String,
pub vocab_size: i64,
pub vocab: Option<HashMap<String, u32>>,
pub tokens: Option<HashMap<String, String>>,
pub encoder: Option<String>,
pub merges: Option<Vec<String>>,
pub pre_tokenizer_pattern: Option<String>,
pub pre_tokenizer_program: Option<PreTokProgram>,
pub byte_fallback_start: Option<i64>,
pub byte_fallback_end: Option<i64>,
pub special_tokens: Option<HashMap<String, u32>>,
pub tool_calling: Option<ToolCallingBlock>,
pub published_at: Option<String>,
}Expand description
A per-model tokenizer dialect — the data needed to encode text into token IDs and decode IDs back to text.
Maps are immutable once published; a new model version publishes a new map at a new URL with a new sha256 hash.
Schema v2: TokenizerMap::vocab is the raw HuggingFace
tokenizer.json form (byte-level GPT-2-encoded chars or ▁-prefixed
metaspace strings). TokenizerMap::tokens is the legacy v1 field,
kept for backwards compatibility — the crate::Detokenizer reads
from whichever is present.
Fields§
§id: StringStable, globally unique tokenizer identifier (e.g. "qwen/qwen2").
version: StringSchema version. "2" for v2 maps; "1" for legacy v1.
vocab_size: i64Total number of token IDs in the vocabulary.
vocab: Option<HashMap<String, u32>>Vocabulary as { raw_token_text → id }. v2 schema field.
tokens: Option<HashMap<String, String>>Legacy v1 vocabulary as { id_string → decoded_text }.
encoder: Option<String>Encoder family: "byte_level", "metaspace", or omitted (identity).
merges: Option<Vec<String>>BPE merges in priority order (lower index = higher priority).
pre_tokenizer_pattern: Option<String>Pre-tokenizer regex pattern. Required for byte_level BPE when
pre_tokenizer_program is absent.
pre_tokenizer_program: Option<PreTokProgram>Compiled pre-tokenizer program. Preferred over pre_tokenizer_pattern
when present — the runtime executes the ops directly with no regex
engine, which unblocks the GPT-2-family maps whose (?i:...) and
(?!\S) syntax the regex crate doesn’t support. See
crate::pretok_program::PreTokProgram and
spec/PRETOKENIZER_PROGRAM.md.
byte_fallback_start: Option<i64>First ID in the byte-fallback range (inclusive). SentencePiece only.
byte_fallback_end: Option<i64>Last ID in the byte-fallback range (inclusive). SentencePiece only.
special_tokens: Option<HashMap<String, u32>>Named special tokens. Skipped during text rendering by default.
tool_calling: Option<ToolCallingBlock>Per-model tool-calling convention. Optional; populated by
@codecai/maps-cli when it detects a known chat-template signature.
Absent on maps generated before this block existed; readers MUST treat
absence as “convention not declared” rather than as an error. See
spec/PROTOCOL.md § “Tool-call calling conventions in the map”.
published_at: Option<String>ISO 8601 publish timestamp. Informational.
Implementations§
Source§impl TokenizerMap
impl TokenizerMap
Sourcepub fn from_json(json: &[u8]) -> Result<Self, TokenizerMapError>
pub fn from_json(json: &[u8]) -> Result<Self, TokenizerMapError>
Parse a TokenizerMap from JSON bytes and validate it.
Sourcepub fn from_json_str(json: &str) -> Result<Self, TokenizerMapError>
pub fn from_json_str(json: &str) -> Result<Self, TokenizerMapError>
Parse from a JSON string and validate.
Sourcepub fn verify_sha256(
bytes: &[u8],
expected: &str,
) -> Result<String, (String, String)>
pub fn verify_sha256( bytes: &[u8], expected: &str, ) -> Result<String, (String, String)>
Verify that bytes hashes to expected (a hex string, optionally
prefixed with sha256:). Returns the actual hex digest.
Sourcepub fn validate(map: &Self) -> Result<(), TokenizerMapError>
pub fn validate(map: &Self) -> Result<(), TokenizerMapError>
Throws on schema violations.
Trait Implementations§
Source§impl Clone for TokenizerMap
impl Clone for TokenizerMap
Source§fn clone(&self) -> TokenizerMap
fn clone(&self) -> TokenizerMap
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more