pub struct Tokenizer { /* private fields */ }Expand description
A text tokenizer supporting BPE and WordPiece sub-word algorithms.
Implementations§
Source§impl Tokenizer
impl Tokenizer
Sourcepub fn new(config: TokenizerConfig) -> Self
pub fn new(config: TokenizerConfig) -> Self
Create a new tokenizer with the given configuration.
Special tokens ([CLS], [SEP], [PAD], [UNK], [MASK]) are
registered automatically.
Sourcepub fn add_token(&mut self, token: &str) -> u32
pub fn add_token(&mut self, token: &str) -> u32
Add a token to the vocabulary. Returns its ID.
If the token already exists, the existing ID is returned.
Sourcepub fn remove_token(&mut self, token: &str) -> bool
pub fn remove_token(&mut self, token: &str) -> bool
Remove a token from the vocabulary. Returns true if it existed.
Special tokens cannot be removed; in that case false is returned.
Sourcepub fn vocab_size(&self) -> usize
pub fn vocab_size(&self) -> usize
Current vocabulary size (including special tokens).
Sourcepub fn contains_token(&self, token: &str) -> bool
pub fn contains_token(&self, token: &str) -> bool
Whether a token is in the vocabulary.
Sourcepub fn add_merge_rule(&mut self, left: &str, right: &str)
pub fn add_merge_rule(&mut self, left: &str, right: &str)
Add a BPE merge rule. The merged token is also added to the vocab.
Sourcepub fn merge_rule_count(&self) -> usize
pub fn merge_rule_count(&self) -> usize
Number of registered merge rules.
Sourcepub fn token_to_id(&self, token: &str) -> Option<u32>
pub fn token_to_id(&self, token: &str) -> Option<u32>
Look up the ID of a token.
Sourcepub fn id_to_token(&self, id: u32) -> Option<&str>
pub fn id_to_token(&self, id: u32) -> Option<&str>
Look up the token string for an ID.
Sourcepub fn encode(&self, text: &str) -> EncodeResult
pub fn encode(&self, text: &str) -> EncodeResult
Encode a text string into token IDs.
The output is truncated to config.max_length.
Sourcepub fn decode(&self, ids: &[u32]) -> String
pub fn decode(&self, ids: &[u32]) -> String
Decode a sequence of token IDs back into a string.
WordPiece continuation tokens (##…) are merged back without spaces.
Sourcepub fn encode_batch(&self, texts: &[&str]) -> Vec<EncodeResult>
pub fn encode_batch(&self, texts: &[&str]) -> Vec<EncodeResult>
Encode a batch of texts.
Sourcepub fn max_length(&self) -> usize
pub fn max_length(&self) -> usize
Maximum sequence length.
Sourcepub fn mode(&self) -> &TokenizerMode
pub fn mode(&self) -> &TokenizerMode
Active tokenization mode.
Sourcepub fn is_lowercase(&self) -> bool
pub fn is_lowercase(&self) -> bool
Whether input is lowercased before tokenization.
Auto Trait Implementations§
impl Freeze for Tokenizer
impl RefUnwindSafe for Tokenizer
impl Send for Tokenizer
impl Sync for Tokenizer
impl Unpin for Tokenizer
impl UnsafeUnpin for Tokenizer
impl UnwindSafe for Tokenizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<T> PolicyExt for Twhere
T: ?Sized,
impl<T> PolicyExt for Twhere
T: ?Sized,
Source§impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
impl<SS, SP> SupersetOf<SS> for SPwhere
SS: SubsetOf<SP>,
Source§fn to_subset(&self) -> Option<SS>
fn to_subset(&self) -> Option<SS>
self from the equivalent element of its
superset. Read moreSource§fn is_in_subset(&self) -> bool
fn is_in_subset(&self) -> bool
self is actually part of its subset T (and can be converted to it).Source§fn to_subset_unchecked(&self) -> SS
fn to_subset_unchecked(&self) -> SS
self.to_subset but without any property checks. Always succeeds.Source§fn from_subset(element: &SS) -> SP
fn from_subset(element: &SS) -> SP
self to the equivalent element of its superset.