Struct TokenizerInfo

Source

pub struct TokenizerInfo { /* private fields */ }

Expand description

TokenizerInfo contains the vocabulary, its type, and metadata used by the grammar-guided generation.

Notes:

Tokens may be encoded differently depending on VocabType (e.g. ByteFallback uses “<0x1B>”, ByteLevel uses unicode mappings). This wrapper exposes the decoded vocabulary in the same form as the original text via decoded_vocab_as_bytes.
Some models pad their vocab size to a multiple of 32 or similar. If your model’s vocab size differs from encoded_vocab.len(), use new_with_vocab_size to pass the model’s vocab size so bitmask sizes are computed correctly.

Implementations§

Source §

impl TokenizerInfo

Source

pub fn new<T: AsRef<str>>( encoded_vocab: &[T], vocab_type: VocabType, stop_token_ids: &Option<Box<[i32]>>, add_prefix_space: bool, ) -> Self

Construct a TokenizerInfo with vocab size derived from encoded_vocab.

If the model’s vocab size differs from encoded_vocab.len(), prefer new_with_vocab_size.

Source

pub fn new_with_vocab_size<T: AsRef<str>>( encoded_vocab: &[T], vocab_type: VocabType, vocab_size: Option<usize>, stop_token_ids: &Option<Box<[i32]>>, add_prefix_space: bool, ) -> Self

Construct a TokenizerInfo with an explicit model vocab_size.

Use this when the model’s vocab size (e.g., padded to a multiple of 32) differs from the tokenizer’s encoded_vocab.len(). Indices in the range [encoded_vocab.len(), vocab_size) are treated as special/reserved.

Source

pub fn from_vocab_and_metadata_bytes<I, B>( encoded_vocab: I, metadata: &str, ) -> Self
where I: IntoIterator<Item = B>, B: AsRef<[u8]>,

Construct TokenizerInfo from encoded vocab (bytes) and a metadata JSON string produced by dump_metadata.

Source

pub fn vocab_type(&self) -> VocabType

The type of the vocabulary.

Source

pub fn vocab_size(&self) -> usize

The size of the vocabulary.

Source

pub fn add_prefix_space(&self) -> bool

Whether the tokenizer will prepend a space before the text in the tokenization process.

Source

pub fn decoded_vocab(&self) -> Box<[Box<[u8]>]>

The decoded vocabulary of the tokenizer. This converts tokens in the LLM’s vocabulary back to the original text form (e.g., ByteFallback “<0x1B>” -> “\u001b”).

Source

pub fn stop_token_ids(&self) -> Box<[i32]>

Stop token ids.

Source

pub fn special_token_ids(&self) -> Box<[i32]>

The special token ids. Special tokens include control tokens, reserved tokens, padded tokens, etc. Now it is automatically detected from the vocabulary.

Source

pub fn dump_metadata(&self) -> String

Dump the metadata of the tokenizer to a json string. It can be used to construct the tokenizer info from the vocabulary and the metadata string.

Source

pub fn serialize_json(&self) -> String

Serialize the tokenizer info to a JSON string.

Source

pub fn deserialize_json(json: &str) -> Result<Self, String>

Deserialize a TokenizerInfo from a JSON string.

Returns

Ok(TokenizerInfo) on success
Err(String) when deserialization fails due to any of the following:
- invalid JSON syntax
- schema/format mismatch with TokenizerInfo serialization
- serialization version mismatch (via the __VERSION__ field) The error string mirrors the C++ exception message.