Skip to main content

Tokenizer

Trait Tokenizer 

Source
pub trait Tokenizer {
    // Required methods
    fn encode(&self, input: &str) -> Result<Vec<String>>;
    fn decode(&self, tokens: Vec<String>) -> Result<String>;
}
Expand description

Defines the necessary functions for a tokenizer.

This trait provides the core functionality needed to convert strings to sequences of tokens and vice versa. It is essential for text processing tasks such as natural language processing, where text needs to be broken down into manageable pieces or reconstructed from tokenized forms.

Required Methods§

Source

fn encode(&self, input: &str) -> Result<Vec<String>>

Encodes a given string into a sequence of tokens.

This function takes a reference to a string and returns a vector of token strings resulting from the tokenization process.

§Arguments
  • input - A reference to the string to be tokenized.
§Returns

A Result containing either the vector of tokens if successful or an error if the tokenization fails.

Source

fn decode(&self, tokens: Vec<String>) -> Result<String>

Decodes a given sequence of tokens back into a single string.

This function takes a vector of token strings and reconstructs the original string.

§Arguments
  • tokens - A vector of strings representing the tokens to be decoded.
§Returns

A Result containing either the reconstructed string if successful or an error if the decoding fails.

Dyn Compatibility§

This trait is dyn compatible.

In older versions of Rust, dyn compatibility was called "object safety".

Implementors§

Source§

impl Tokenizer for ct2rs::tokenizers::auto::Tokenizer

Source§

impl Tokenizer for ct2rs::tokenizers::hf::Tokenizer

Available on crate feature tokenizers only.
Source§

impl Tokenizer for ct2rs::tokenizers::sentencepiece::Tokenizer

Available on crate feature sentencepiece only.