Struct BytePairEncoding

Source
pub struct BytePairEncoding { /* private fields */ }
Expand description

Representation of the byte pair dictionary. This struct provides various conversions. We put all of them into a single struct so that they can be reused by different implementations.

Implementations§

Source§

impl BytePairEncoding

Source

pub fn from_dictionary( tokens: impl IntoIterator<Item = Vec<u8>>, hash_factor: Option<u64>, ) -> BytePairEncoding

Construct a BytePairEncoding instance from an iterator that enumerates all tokens. A suitable hash factor may be necessary to prevent hash collisions, which can be found using [find_hash_factor_for_dictionary].

The recommended approach is to store the serialized value and reuse that, to prevent repeating the cost of computing the hash factor and encoding.

Source

pub fn num_tokens(&self) -> usize

Return the number of tokens in this BPE dictionary.

Source

pub fn token_bytes(&self, token_id: u32) -> &[u8]

Converts a token id into its corresponding token bytes. Panics if the token_id is not within the valid 0..num_tokens() range!

Source

pub fn token_len(&self, token_id: u32) -> usize

Returns the length of the decoded byte slice of a token.

Source

pub fn decode_tokens(&self, tokens: &[u32]) -> Vec<u8>

Decode a sequence of tokens back to its original byte sequence. Note: we don’t return here a str, since not every token sequence corresponds to a valid utf8 sequence.

Source

pub fn count(&self, text: &[u8]) -> usize

Counts the number tokens produced when encoding the text.

Source

pub fn count_till_limit(&self, text: &[u8], token_limit: usize) -> Option<usize>

Returns the token count iff the total token count stays below the specified token_limit. Otherwise, it returns none. This function can be faster than count when the token_limit is much smaller than the provided text.

Source

pub fn encode_via_table(&self, text: &[u8]) -> Vec<u32>

Source

pub fn encode_via_backtracking(&self, text: &[u8]) -> Vec<u32>

Source

pub fn encode_via_bitfield(&self, text: &[u8]) -> Vec<u32>

Source

pub fn encode_greedy(&self, text: &[u8]) -> Vec<u32>

It is not recommended to use this function, since it doesn’t output the correct BPE encoded sequence.

Source

pub fn encode_minimal(&self, text: &[u8]) -> Vec<u32>

This function computes the shortest possible encoding sequence which will usually differ from the tokenization produced by the original BPE algorithm.

Trait Implementations§

Source§

impl<'de> Deserialize<'de> for BytePairEncoding

Source§

fn deserialize<__D>( __deserializer: __D, ) -> Result<BytePairEncoding, <__D as Deserializer<'de>>::Error>
where __D: Deserializer<'de>,

Deserialize this value from the given Serde deserializer. Read more
Source§

impl Serialize for BytePairEncoding

Source§

fn serialize<__S>( &self, __serializer: __S, ) -> Result<<__S as Serializer>::Ok, <__S as Serializer>::Error>
where __S: Serializer,

Serialize this value into the given Serde serializer. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> DeserializeOwned for T
where T: for<'de> Deserialize<'de>,