Struct tokenizers::tokenizer::AddedVocabulary
source · pub struct AddedVocabulary { /* private fields */ }
Expand description
A vocabulary built on top of the Model
This provides a way to add new vocabulary to a Tokenizer that has already been trained, in a previous process, maybe by someone else. This is especially interesting in the case of fine-tunings, where we want to finetune a model while adding some new functionalities using some new special tokens, or maybe add some tokens in the case of unknown tokens, etc.
One of the reasons we need to handle these tokens outside of the model is simply that for many models, it is not possible to add new tokens after the training process. For example, using BPE, the training process generates merges pairs along the vocabulary, and any token in the vocabulary can be decomposed in other tokens, down to the original alphabet. If we were to add new tokens after this training process, we couldn’t make sure the merges pairs exist as required.
Implementations§
source§impl AddedVocabulary
impl AddedVocabulary
pub fn new() -> Self
sourcepub fn get_added_tokens_decoder(&self) -> &HashMap<u32, AddedToken>
pub fn get_added_tokens_decoder(&self) -> &HashMap<u32, AddedToken>
Get the additional vocabulary with the AddedTokens
sourcepub fn token_to_id(&self, token: &str, model: &impl Model) -> Option<u32>
pub fn token_to_id(&self, token: &str, model: &impl Model) -> Option<u32>
Get the id matching one of our token if it exists
sourcepub fn id_to_token(&self, id: u32, model: &impl Model) -> Option<String>
pub fn id_to_token(&self, id: u32, model: &impl Model) -> Option<String>
Get the token matching the given id if it exists
pub fn set_encode_special_tokens(&mut self, value: bool)
pub fn get_encode_special_tokens(&self) -> bool
sourcepub fn is_special_token(&self, token: &str) -> bool
pub fn is_special_token(&self, token: &str) -> bool
Check if a token is a special token
sourcepub fn add_special_tokens<N: Normalizer>(
&mut self,
tokens: &[AddedToken],
model: &impl Model,
normalizer: Option<&N>
) -> usize
pub fn add_special_tokens<N: Normalizer>( &mut self, tokens: &[AddedToken], model: &impl Model, normalizer: Option<&N> ) -> usize
Add some special tokens to the vocabulary
sourcepub fn add_tokens<N: Normalizer>(
&mut self,
tokens: &[AddedToken],
model: &impl Model,
normalizer: Option<&N>
) -> usize
pub fn add_tokens<N: Normalizer>( &mut self, tokens: &[AddedToken], model: &impl Model, normalizer: Option<&N> ) -> usize
Add some tokens to the vocabulary
sourcepub fn extract_and_normalize<N: Normalizer>(
&self,
normalizer: Option<&N>,
sequence: &str
) -> PreTokenizedString
pub fn extract_and_normalize<N: Normalizer>( &self, normalizer: Option<&N>, sequence: &str ) -> PreTokenizedString
Extract the additional vocabulary from the given sentence, normalizing it along the way.
Some tokens should match against their normalized representation, as well as the
non-normalized one. For example, when we expect to extract the token yesterday
in the
input sentence I read a book Yesterday
, if the normalizer is supposed to lowercase
everything, we expect a match.
Trait Implementations§
source§impl Clone for AddedVocabulary
impl Clone for AddedVocabulary
source§fn clone(&self) -> AddedVocabulary
fn clone(&self) -> AddedVocabulary
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read moresource§impl Debug for AddedVocabulary
impl Debug for AddedVocabulary
source§impl Default for AddedVocabulary
impl Default for AddedVocabulary
Auto Trait Implementations§
impl Freeze for AddedVocabulary
impl RefUnwindSafe for AddedVocabulary
impl Send for AddedVocabulary
impl Sync for AddedVocabulary
impl Unpin for AddedVocabulary
impl UnwindSafe for AddedVocabulary
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
source§impl<T> IntoEither for T
impl<T> IntoEither for T
source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moresource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more