Skip to main content

Tokenizer

Trait Tokenizer 

Source
pub trait Tokenizer {
    // Required method
    fn tokenize<'a>(&self, text: &'a str) -> Vec<&'a str>;

    // Provided method
    fn token_count(&self, text: &str) -> usize { ... }
}
Expand description

Tokenizer over text slices.

Implementations are expected to be cheap to construct — ideally zero-size — and stateless. Methods take &self to allow future implementations that carry configuration (e.g. vocabulary, normalisation flags).

Required Methods§

Source

fn tokenize<'a>(&self, text: &'a str) -> Vec<&'a str>

Split text into tokens, returning slices into the original string.

Provided Methods§

Source

fn token_count(&self, text: &str) -> usize

Count the number of tokens in text.

Implementations should override this when a direct count is cheaper than collecting tokens into a Vec.

Implementors§