#[non_exhaustive]pub enum MITokenizer {
HuggingFace(Box<Tokenizer>),
Rwkv(RwkvTokenizer),
}Expand description
Unified tokenizer supporting multiple backends.
Most models use the HuggingFace tokenizers crate. RWKV-6 models
ship their own vocabulary format and require a custom trie-based
tokenizer, which is available behind the rwkv-tokenizer feature.
§Example
use candle_mi::MITokenizer;
let tok = MITokenizer::from_hf_path("tokenizer.json")?;
let ids = tok.encode("fn main()")?;
let text = tok.decode(&ids)?;
assert!(!ids.is_empty());Variants (Non-exhaustive)§
This enum is marked as non-exhaustive
HuggingFace(Box<Tokenizer>)
HuggingFace tokenizers backend.
Rwkv(RwkvTokenizer)
RWKV World tokenizer (trie-based greedy longest-match).
Implementations§
Source§impl MITokenizer
impl MITokenizer
Sourcepub fn from_hf_path(path: impl AsRef<Path>) -> Result<Self>
pub fn from_hf_path(path: impl AsRef<Path>) -> Result<Self>
Load a HuggingFace tokenizer from a tokenizer.json file.
§Errors
Returns MIError::Tokenizer if the file cannot be loaded or parsed.
Sourcepub fn from_rwkv_path(path: impl AsRef<Path>) -> Result<Self>
pub fn from_rwkv_path(path: impl AsRef<Path>) -> Result<Self>
Load an RWKV World tokenizer from a vocabulary file.
§Errors
Returns MIError::Tokenizer if the file cannot be loaded or parsed.
Sourcepub fn encode(&self, text: &str) -> Result<Vec<u32>>
pub fn encode(&self, text: &str) -> Result<Vec<u32>>
Encode text into token IDs, adding special tokens (e.g. BOS for Gemma).
Special tokens are added according to the tokenizer’s configured
post-processor, matching the HuggingFace convention for inference.
§Errors
Returns MIError::Tokenizer if encoding fails.
Sourcepub fn encode_raw(&self, text: &str) -> Result<Vec<u32>>
pub fn encode_raw(&self, text: &str) -> Result<Vec<u32>>
Encode text into token IDs without adding special tokens.
Useful for MI analyses that need raw tokenization without BOS/EOS.
§Errors
Returns MIError::Tokenizer if encoding fails.
Sourcepub fn encode_with_offsets(&self, text: &str) -> Result<EncodingWithOffsets>
pub fn encode_with_offsets(&self, text: &str) -> Result<EncodingWithOffsets>
Encode text into token IDs with character offset mapping.
Returns an EncodingWithOffsets containing token IDs, token strings,
and byte-offset ranges for each token. Special tokens are added
(e.g., BOS for Gemma); special tokens receive a (0, 0) offset.
§Errors
Returns MIError::Tokenizer if encoding fails or if the backend
does not support offset mapping (RWKV).
Sourcepub fn encode_raw_with_offsets(&self, text: &str) -> Result<EncodingWithOffsets>
pub fn encode_raw_with_offsets(&self, text: &str) -> Result<EncodingWithOffsets>
Encode text into token IDs with character offset mapping, without adding special tokens.
§Errors
Returns MIError::Tokenizer if encoding fails or if the backend
does not support offset mapping (RWKV).
Sourcepub fn vocab_size(&self) -> usize
pub fn vocab_size(&self) -> usize
Get vocabulary size.
Sourcepub fn find_token_id(&self, word: &str) -> Result<u32>
pub fn find_token_id(&self, word: &str) -> Result<u32>
Find the token ID for a word, trying " word" (with leading space) first,
then bare "word".
This handles BPE tokenizers that represent word-initial tokens with a
leading space (e.g., " cat" → single token).
§Errors
Returns MIError::Tokenizer if the word cannot be resolved to a
single token in either form.
Sourcepub fn decode_token(&self, token_id: u32) -> Result<String>
pub fn decode_token(&self, token_id: u32) -> Result<String>
Decode a single token ID to its string representation.
§Errors
Returns MIError::Tokenizer if decoding fails.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for MITokenizer
impl RefUnwindSafe for MITokenizer
impl Send for MITokenizer
impl Sync for MITokenizer
impl Unpin for MITokenizer
impl UnsafeUnpin for MITokenizer
impl UnwindSafe for MITokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more