pub struct Tokenizer {
pub bpe: BytePairEncoding,
pub pre: Option<Pretokenizer>,
}
Expand description
A byte-pair encoding tokenizer that supports a pre-tokenization regex. The direct methods on this type pre-tokenize the input text and should produce the same output as the tiktoken tokenizers. The type gives access to the regex and underlying byte-pair encoding if needed. Note that using the byte-pair encoding directly does not take the regex into account and may result in output that differs from tiktoken.
Fields§
§bpe: BytePairEncoding
The byte-pair encoding for this tokenizer.
pre: Option<Pretokenizer>
The pattern regex used to split the input.
Implementations§
Source§impl Tokenizer
impl Tokenizer
Sourcepub fn new(bpe: BytePairEncoding, pat: Option<&str>) -> Result<Self, BuildError>
pub fn new(bpe: BytePairEncoding, pat: Option<&str>) -> Result<Self, BuildError>
Build a tokenizer with an optional pretokenization regex pattern.
Sourcepub fn new_lookahead(
bpe: BytePairEncoding,
patterns: &[(&str, bool)],
) -> Result<Self, BuildError>
pub fn new_lookahead( bpe: BytePairEncoding, patterns: &[(&str, bool)], ) -> Result<Self, BuildError>
Build a tokenizer with pretokenization regex patterns. If the boolean for a pattern is true, the pattern is assumed to be a look-ahead pattern with exactly one look-ahead character!
Sourcepub fn count(&self, text: &str) -> usize
pub fn count(&self, text: &str) -> usize
Count the number of tokens produced when encoding the text. Applies pre-tokenization before counting.
Sourcepub fn count_till_limit(&self, text: &str, token_limit: usize) -> Option<usize>
pub fn count_till_limit(&self, text: &str, token_limit: usize) -> Option<usize>
Returns the token count iff the total token count stays below the specified token_limit.
Otherwise, it returns none. This function can be faster than Self::count
` when the
token limit is much smaller than the provided text. Applies pre-tokenization before counting.
Sourcepub fn encode(&self, text: &str) -> Vec<u32>
pub fn encode(&self, text: &str) -> Vec<u32>
Returns the tokens for the encoding of the given text. Applies pre-tokenization before encoding.
Auto Trait Implementations§
impl Freeze for Tokenizer
impl RefUnwindSafe for Tokenizer
impl Send for Tokenizer
impl Sync for Tokenizer
impl Unpin for Tokenizer
impl UnwindSafe for Tokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more