Skip to main content

Tokenize

Trait Tokenize 

Source
pub trait Tokenize: Send + Sync {
    // Required methods
    fn encode(&self, text: &str) -> Vec<u32>;
    fn decode(&self, ids: &[u32]) -> Result<String, TokenizeError>;
    fn vocab_size(&self) -> usize;
}
Expand description

Common interface for all tokenizer backends.

Implemented by Tokenizer (BPE), SentencePieceTokenizer (unigram), and WordPieceTokenizer (WordPiece).

Required Methods§

Source

fn encode(&self, text: &str) -> Vec<u32>

Encode text into token IDs.

Source

fn decode(&self, ids: &[u32]) -> Result<String, TokenizeError>

Decode token IDs back to text.

Returns an error if any token ID is invalid.

Source

fn vocab_size(&self) -> usize

Return the vocabulary size (number of distinct tokens).

Implementors§