TokenizerInfo

Struct TokenizerInfo 

Source
pub struct TokenizerInfo { /* private fields */ }
Expand description

TokenizerInfo contains the vocabulary, its type, and metadata used by the grammar-guided generation.

Notes:

  • Tokens may be encoded differently depending on VocabType (e.g. ByteFallback uses “<0x1B>”, ByteLevel uses unicode mappings). This wrapper exposes the decoded vocabulary in the same form as the original text via decoded_vocab_as_bytes.
  • Some models pad their vocab size to a multiple of 32 or similar. If your model’s vocab size differs from encoded_vocab.len(), use new_with_vocab_size to pass the model’s vocab size so bitmask sizes are computed correctly.

Implementations§

Source§

impl TokenizerInfo

Source

pub fn new<T: AsRef<str>>( encoded_vocab: &[T], vocab_type: VocabType, stop_token_ids: &Option<Box<[i32]>>, add_prefix_space: bool, ) -> Self

Construct a TokenizerInfo with vocab size derived from encoded_vocab.

If the model’s vocab size differs from encoded_vocab.len(), prefer new_with_vocab_size.

Source

pub fn new_with_vocab_size<T: AsRef<str>>( encoded_vocab: &[T], vocab_type: VocabType, vocab_size: Option<usize>, stop_token_ids: &Option<Box<[i32]>>, add_prefix_space: bool, ) -> Self

Construct a TokenizerInfo with an explicit model vocab_size.

Use this when the model’s vocab size (e.g., padded to a multiple of 32) differs from the tokenizer’s encoded_vocab.len(). Indices in the range [encoded_vocab.len(), vocab_size) are treated as special/reserved.

Source

pub fn from_vocab_and_metadata_bytes<I, B>( encoded_vocab: I, metadata: &str, ) -> Self
where I: IntoIterator<Item = B>, B: AsRef<[u8]>,

Construct TokenizerInfo from encoded vocab (bytes) and a metadata JSON string produced by dump_metadata.

Source

pub fn vocab_type(&self) -> VocabType

The type of the vocabulary.

Source

pub fn vocab_size(&self) -> usize

The size of the vocabulary.

Source

pub fn add_prefix_space(&self) -> bool

Whether the tokenizer will prepend a space before the text in the tokenization process.

Source

pub fn decoded_vocab(&self) -> Box<[Box<[u8]>]>

The decoded vocabulary of the tokenizer. This converts tokens in the LLM’s vocabulary back to the original text form (e.g., ByteFallback “<0x1B>” -> “\u001b”).

Source

pub fn stop_token_ids(&self) -> Box<[i32]>

Stop token ids.

Source

pub fn special_token_ids(&self) -> Box<[i32]>

The special token ids. Special tokens include control tokens, reserved tokens, padded tokens, etc. Now it is automatically detected from the vocabulary.

Source

pub fn dump_metadata(&self) -> String

Dump the metadata of the tokenizer to a json string. It can be used to construct the tokenizer info from the vocabulary and the metadata string.

Source

pub fn serialize_json(&self) -> String

Serialize the tokenizer info to a JSON string.

Source

pub fn deserialize_json(json: &str) -> Result<Self, String>

Deserialize a TokenizerInfo from a JSON string.

Returns

  • Ok(TokenizerInfo) on success
  • Err(String) when deserialization fails due to any of the following:
    • invalid JSON syntax
    • schema/format mismatch with TokenizerInfo serialization
    • serialization version mismatch (via the __VERSION__ field) The error string mirrors the C++ exception message.

Trait Implementations§

Source§

impl Drop for TokenizerInfo

Source§

fn drop(&mut self)

Executes the destructor for this type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.