Struct tokenizers::tokenizer::AddedVocabulary

source ·
pub struct AddedVocabulary { /* private fields */ }
Expand description

A vocabulary built on top of the Model

This provides a way to add new vocabulary to a Tokenizer that has already been trained, in a previous process, maybe by someone else. This is especially interesting in the case of fine-tunings, where we want to finetune a model while adding some new functionalities using some new special tokens, or maybe add some tokens in the case of unknown tokens, etc.

One of the reasons we need to handle these tokens outside of the model is simply that for many models, it is not possible to add new tokens after the training process. For example, using BPE, the training process generates merges pairs along the vocabulary, and any token in the vocabulary can be decomposed in other tokens, down to the original alphabet. If we were to add new tokens after this training process, we couldn’t make sure the merges pairs exist as required.

Implementations§

source§

impl AddedVocabulary

source

pub fn new() -> Self

source

pub fn len(&self) -> usize

Size of the additional vocabulary

source

pub fn is_empty(&self) -> bool

Whether or not this vocabulary is empty

source

pub fn get_vocab(&self) -> &HashMap<String, u32>

Get the additional vocabulary

source

pub fn get_added_tokens_decoder(&self) -> &HashMap<u32, AddedToken>

Get the additional vocabulary with the AddedTokens

source

pub fn token_to_id(&self, token: &str, model: &impl Model) -> Option<u32>

Get the id matching one of our token if it exists

source

pub fn id_to_token(&self, id: u32, model: &impl Model) -> Option<String>

Get the token matching the given id if it exists

source

pub fn set_encode_special_tokens(&mut self, value: bool)

source

pub fn get_encode_special_tokens(&self) -> bool

source

pub fn is_special_token(&self, token: &str) -> bool

Check if a token is a special token

source

pub fn add_special_tokens<N: Normalizer>( &mut self, tokens: &[AddedToken], model: &impl Model, normalizer: Option<&N> ) -> usize

Add some special tokens to the vocabulary

source

pub fn add_tokens<N: Normalizer>( &mut self, tokens: &[AddedToken], model: &impl Model, normalizer: Option<&N> ) -> usize

Add some tokens to the vocabulary

source

pub fn extract_and_normalize<N: Normalizer>( &self, normalizer: Option<&N>, sequence: &str ) -> PreTokenizedString

Extract the additional vocabulary from the given sentence, normalizing it along the way.

Some tokens should match against their normalized representation, as well as the non-normalized one. For example, when we expect to extract the token yesterday in the input sentence I read a book Yesterday, if the normalizer is supposed to lowercase everything, we expect a match.

Trait Implementations§

source§

impl Clone for AddedVocabulary

source§

fn clone(&self) -> AddedVocabulary

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for AddedVocabulary

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for AddedVocabulary

source§

fn default() -> Self

Returns the “default value” for a type. Read more
source§

impl Serialize for AddedVocabulary

source§

fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
where S: Serializer,

Serialize this value into the given Serde serializer. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> IntoEither for T

source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

impl<T> Pointable for T

source§

const ALIGN: usize = _

The alignment of pointer.
§

type Init = T

The type for initializers.
source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V