pub struct TokenizerInfo { /* private fields */ }Expand description
TokenizerInfo contains the vocabulary, its type, and metadata used by the grammar-guided generation.
Notes:
- Tokens may be encoded differently depending on
VocabType(e.g. ByteFallback uses “<0x1B>”, ByteLevel uses unicode mappings). This wrapper exposes the decoded vocabulary in the same form as the original text viadecoded_vocab_as_bytes. - Some models pad their vocab size to a multiple of 32 or similar. If your
model’s vocab size differs from
encoded_vocab.len(), usenew_with_vocab_sizeto pass the model’s vocab size so bitmask sizes are computed correctly.
Implementations§
Source§impl TokenizerInfo
impl TokenizerInfo
Sourcepub fn new<T: AsRef<str>>(
encoded_vocab: &[T],
vocab_type: VocabType,
stop_token_ids: &Option<Box<[i32]>>,
add_prefix_space: bool,
) -> Self
pub fn new<T: AsRef<str>>( encoded_vocab: &[T], vocab_type: VocabType, stop_token_ids: &Option<Box<[i32]>>, add_prefix_space: bool, ) -> Self
Construct a TokenizerInfo with vocab size derived from encoded_vocab.
If the model’s vocab size differs from encoded_vocab.len(), prefer
new_with_vocab_size.
Sourcepub fn new_with_vocab_size<T: AsRef<str>>(
encoded_vocab: &[T],
vocab_type: VocabType,
vocab_size: Option<usize>,
stop_token_ids: &Option<Box<[i32]>>,
add_prefix_space: bool,
) -> Self
pub fn new_with_vocab_size<T: AsRef<str>>( encoded_vocab: &[T], vocab_type: VocabType, vocab_size: Option<usize>, stop_token_ids: &Option<Box<[i32]>>, add_prefix_space: bool, ) -> Self
Construct a TokenizerInfo with an explicit model vocab_size.
Use this when the model’s vocab size (e.g., padded to a multiple of 32)
differs from the tokenizer’s encoded_vocab.len(). Indices in the range
[encoded_vocab.len(), vocab_size) are treated as special/reserved.
Sourcepub fn from_vocab_and_metadata_bytes<I, B>(
encoded_vocab: I,
metadata: &str,
) -> Self
pub fn from_vocab_and_metadata_bytes<I, B>( encoded_vocab: I, metadata: &str, ) -> Self
Construct TokenizerInfo from encoded vocab (bytes) and a metadata JSON
string produced by dump_metadata.
Sourcepub fn vocab_type(&self) -> VocabType
pub fn vocab_type(&self) -> VocabType
The type of the vocabulary.
Sourcepub fn vocab_size(&self) -> usize
pub fn vocab_size(&self) -> usize
The size of the vocabulary.
Sourcepub fn add_prefix_space(&self) -> bool
pub fn add_prefix_space(&self) -> bool
Whether the tokenizer will prepend a space before the text in the tokenization process.
Sourcepub fn decoded_vocab(&self) -> Box<[Box<[u8]>]>
pub fn decoded_vocab(&self) -> Box<[Box<[u8]>]>
The decoded vocabulary of the tokenizer. This converts tokens in the LLM’s vocabulary back to the original text form (e.g., ByteFallback “<0x1B>” -> “\u001b”).
Sourcepub fn stop_token_ids(&self) -> Box<[i32]>
pub fn stop_token_ids(&self) -> Box<[i32]>
Stop token ids.
Sourcepub fn special_token_ids(&self) -> Box<[i32]>
pub fn special_token_ids(&self) -> Box<[i32]>
The special token ids. Special tokens include control tokens, reserved tokens, padded tokens, etc. Now it is automatically detected from the vocabulary.
Sourcepub fn dump_metadata(&self) -> String
pub fn dump_metadata(&self) -> String
Dump the metadata of the tokenizer to a json string. It can be used to construct the tokenizer info from the vocabulary and the metadata string.
Sourcepub fn serialize_json(&self) -> String
pub fn serialize_json(&self) -> String
Serialize the tokenizer info to a JSON string.
Sourcepub fn deserialize_json(json: &str) -> Result<Self, String>
pub fn deserialize_json(json: &str) -> Result<Self, String>
Deserialize a TokenizerInfo from a JSON string.
Returns
Ok(TokenizerInfo)on successErr(String)when deserialization fails due to any of the following:- invalid JSON syntax
- schema/format mismatch with
TokenizerInfoserialization - serialization version mismatch (via the
__VERSION__field) The error string mirrors the C++ exception message.