pub struct TokenDict { /* private fields */ }Expand description
a simple dictionary based tokenizer dictionary where single bytes implicitly form the lower 256 ids. It’s reference counted and cheap to clone
Implementations§
Source§impl TokenDict
impl TokenDict
Sourcepub fn detokenize<I: IntoIterator>(
&self,
tokens: I,
) -> Detokenization<I::IntoIter> ⓘ
pub fn detokenize<I: IntoIterator>( &self, tokens: I, ) -> Detokenization<I::IntoIter> ⓘ
decodes the tokens into bytes
Sourcepub fn detoken_iter<I: IntoIterator>(
&self,
tokens: I,
) -> impl Iterator<Item = Token>
pub fn detoken_iter<I: IntoIterator>( &self, tokens: I, ) -> impl Iterator<Item = Token>
creates an iterator over tokens
Sourcepub fn detokenize_str<I: IntoIterator>(
&self,
tokens: I,
) -> impl Iterator<Item = char>
pub fn detokenize_str<I: IntoIterator>( &self, tokens: I, ) -> impl Iterator<Item = char>
decodes the tokens into chars, replacing invalid unicode with replacement character
Examples found in repository?
More examples
Sourcepub fn detokenize_string<I: IntoIterator>(&self, tokens: I) -> String
pub fn detokenize_string<I: IntoIterator>(&self, tokens: I) -> String
decodes the tokens into chars, replacing invalid unicode with replacement character
Sourcepub fn frequencies<I: IntoIterator, O: Into<Option<Vec<usize>>>>(
&self,
data: I,
freq: O,
) -> Vec<usize>
pub fn frequencies<I: IntoIterator, O: Into<Option<Vec<usize>>>>( &self, data: I, freq: O, ) -> Vec<usize>
accumulates frequencies of each token in the text as if they were tokenized by this tokenizer
Sourcepub fn get_id(&self, token: &[u8]) -> Option<u32>
pub fn get_id(&self, token: &[u8]) -> Option<u32>
gets an id for the token if it is in the dictionary
Sourcepub fn iter(&self) -> DictIter<'_> ⓘ
pub fn iter(&self) -> DictIter<'_> ⓘ
returns an interator over the possible tokens generated by this tokenizer
Sourcepub fn len(&self) -> usize
pub fn len(&self) -> usize
returns the number of possible token ids generated by this tokenizer
Sourcepub fn pairs<I: IntoIterator, O: Into<Option<HashMap<Token, usize>>>>(
&self,
data: I,
freq: O,
) -> HashMap<Token, usize>
pub fn pairs<I: IntoIterator, O: Into<Option<HashMap<Token, usize>>>>( &self, data: I, freq: O, ) -> HashMap<Token, usize>
finds token pairs and returns new tokens of them mapped to their frequencies with ids as if they were added to this dictionary. Tokens with ids within the current dictionary will have those ids
Sourcepub fn string_to_tokens<S: ?Sized + AsRef<str>>(&self, input: &S) -> Vec<u32>
pub fn string_to_tokens<S: ?Sized + AsRef<str>>(&self, input: &S) -> Vec<u32>
converts the string to a token vec
Sourcepub fn token_iter<I: IntoIterator>(
&self,
bytes: I,
) -> impl Iterator<Item = Token>
pub fn token_iter<I: IntoIterator>( &self, bytes: I, ) -> impl Iterator<Item = Token>
creates an iterator over tokens
Sourcepub fn tokenize<I: IntoIterator>(&self, bytes: I) -> Tokenization<I::IntoIter> ⓘ
pub fn tokenize<I: IntoIterator>(&self, bytes: I) -> Tokenization<I::IntoIter> ⓘ
converts the bytes to tokens
Sourcepub fn tokenize_str<'a, S: ?Sized + AsRef<str>>(
&self,
input: &'a S,
) -> Tokenization<SliceIter<'a, u8>> ⓘ
pub fn tokenize_str<'a, S: ?Sized + AsRef<str>>( &self, input: &'a S, ) -> Tokenization<SliceIter<'a, u8>> ⓘ
converts the string to tokens
Examples found in repository?
More examples
Sourcepub fn tokenize_string(&self, input: String) -> Tokenization<VecIntoIter<u8>> ⓘ
pub fn tokenize_string(&self, input: String) -> Tokenization<VecIntoIter<u8>> ⓘ
converts the string to tokens
Trait Implementations§
Source§impl<A: AsRef<[u8]>> Extend<A> for TokenDict
impl<A: AsRef<[u8]>> Extend<A> for TokenDict
Source§fn extend<I: IntoIterator<Item = A>>(&mut self, iter: I)
fn extend<I: IntoIterator<Item = A>>(&mut self, iter: I)
Source§fn extend_one(&mut self, item: A)
fn extend_one(&mut self, item: A)
extend_one)Source§fn extend_reserve(&mut self, additional: usize)
fn extend_reserve(&mut self, additional: usize)
extend_one)