pub struct Tokenizer {
pub bpe: BytePairEncoding,
pub pre: Option<Pretokenizer>,
/* private fields */
}
Expand description
A byte-pair encoding tokenizer that supports a pre-tokenization regex. The direct methods on this type pre-tokenize the input text and should produce the same output as the tiktoken tokenizers. The type gives access to the regex and underlying byte-pair encoding if needed. Note that using the byte-pair encoding directly does not take the regex into account and may result in output that differs from tiktoken.
Fields§
§bpe: BytePairEncoding
The byte-pair encoding for this tokenizer.
pre: Option<Pretokenizer>
The pattern regex used to split the input.
Implementations§
Source§impl Tokenizer
impl Tokenizer
Sourcepub fn new(
bpe: BytePairEncoding,
pat: Option<&str>,
nfc: bool,
) -> Result<Self, BuildError>
pub fn new( bpe: BytePairEncoding, pat: Option<&str>, nfc: bool, ) -> Result<Self, BuildError>
Build a tokenizer with an optional pretokenization regex pattern.
Sourcepub fn new_lookahead(
bpe: BytePairEncoding,
patterns: &[(&str, bool)],
nfc: bool,
) -> Result<Self, BuildError>
pub fn new_lookahead( bpe: BytePairEncoding, patterns: &[(&str, bool)], nfc: bool, ) -> Result<Self, BuildError>
Build a tokenizer with pretokenization regex patterns. If the boolean for a pattern is true, the pattern is assumed to be a look-ahead pattern with exactly one look-ahead character!
Sourcepub fn count<'a, I: Normalizable<'a>>(&self, text: I) -> usize
pub fn count<'a, I: Normalizable<'a>>(&self, text: I) -> usize
Count the number of tokens produced when encoding the text. Applies pre-tokenization before counting.
Sourcepub fn count_till_limit(
&self,
text: &NormalizedString<'_>,
token_limit: usize,
) -> Option<usize>
pub fn count_till_limit( &self, text: &NormalizedString<'_>, token_limit: usize, ) -> Option<usize>
Returns the token count iff the total token count stays below the specified token_limit.
Otherwise, it returns none. This function can be faster than Self::count
` when the
token limit is much smaller than the provided text. Applies pre-tokenization before counting.
Note: This function assumes that the text is already normalized, so that this function can run in roughly O(token_limit) time.
Sourcepub fn encode<'a, I: Normalizable<'a>>(&self, text: I) -> Vec<u32>
pub fn encode<'a, I: Normalizable<'a>>(&self, text: I) -> Vec<u32>
Returns the tokens for the encoding of the given text. Applies pre-tokenization before encoding.
Sourcepub fn decode(&self, tokens: &[u32]) -> Option<String>
pub fn decode(&self, tokens: &[u32]) -> Option<String>
Returns the text corresponding to the given encoding if it is valid UTF-8. Otherwise, returns none.
Sourcepub fn split<'a>(&'a self, text: &'a str) -> impl Iterator<Item = &'a str>
pub fn split<'a>(&'a self, text: &'a str) -> impl Iterator<Item = &'a str>
Returns an iterator with the text pieces resulting from pre-tokenization. If this tokenizer does not have pre-tokenization, the iterator returns the full text.
Sourcepub fn normalize<'a, I: Normalizable<'a>>(
&self,
text: I,
) -> NormalizedString<'a>
pub fn normalize<'a, I: Normalizable<'a>>( &self, text: I, ) -> NormalizedString<'a>
Returns the normalized text if the tokenizer requires normalization. If the input was already normalized, this function is a noop.
Auto Trait Implementations§
impl Freeze for Tokenizer
impl RefUnwindSafe for Tokenizer
impl Send for Tokenizer
impl Sync for Tokenizer
impl Unpin for Tokenizer
impl UnwindSafe for Tokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more