Struct Tokenizer

Source
pub struct Tokenizer {
    pub bpe: BytePairEncoding,
    pub pre: Option<Pretokenizer>,
}
Expand description

A byte-pair encoding tokenizer that supports a pre-tokenization regex. The direct methods on this type pre-tokenize the input text and should produce the same output as the tiktoken tokenizers. The type gives access to the regex and underlying byte-pair encoding if needed. Note that using the byte-pair encoding directly does not take the regex into account and may result in output that differs from tiktoken.

Fields§

§bpe: BytePairEncoding

The byte-pair encoding for this tokenizer.

§pre: Option<Pretokenizer>

The pattern regex used to split the input.

Implementations§

Source§

impl Tokenizer

Source

pub fn new(bpe: BytePairEncoding, pat: Option<&str>) -> Result<Self, BuildError>

Build a tokenizer with an optional pretokenization regex pattern.

Source

pub fn new_lookahead( bpe: BytePairEncoding, patterns: &[(&str, bool)], ) -> Result<Self, BuildError>

Build a tokenizer with pretokenization regex patterns. If the boolean for a pattern is true, the pattern is assumed to be a look-ahead pattern with exactly one look-ahead character!

Source

pub fn count(&self, text: &str) -> usize

Count the number of tokens produced when encoding the text. Applies pre-tokenization before counting.

Source

pub fn count_till_limit(&self, text: &str, token_limit: usize) -> Option<usize>

Returns the token count iff the total token count stays below the specified token_limit. Otherwise, it returns none. This function can be faster than Self::count` when the token limit is much smaller than the provided text. Applies pre-tokenization before counting.

Source

pub fn encode(&self, text: &str) -> Vec<u32>

Returns the tokens for the encoding of the given text. Applies pre-tokenization before encoding.

Source

pub fn decode(&self, tokens: &[u32]) -> Option<String>

Returns the text corresponding to the given encoding if it is valid UTF-8. Otherwise, returns none.

Source

pub fn split<'a>(&'a self, text: &'a str) -> impl Iterator<Item = &'a str> + 'a

Returns an iterator with the text pieces resulting from pre-tokenization. If this tokenizer does not have pre-tokenization, the iterator returns the full text.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.