pub struct BytePairEncoding { /* private fields */ }
Expand description
Representation of the byte pair dictionary. This struct provides various conversions. We put all of them into a single struct so that they can be reused by different implementations.
Implementations§
Source§impl BytePairEncoding
impl BytePairEncoding
Sourcepub fn from_dictionary(
tokens: impl IntoIterator<Item = Vec<u8>>,
hash_factor: Option<u64>,
) -> BytePairEncoding
pub fn from_dictionary( tokens: impl IntoIterator<Item = Vec<u8>>, hash_factor: Option<u64>, ) -> BytePairEncoding
Construct a BytePairEncoding instance from an iterator that enumerates all tokens.
A suitable hash factor may be necessary to prevent hash collisions, which can be
found using [find_hash_factor_for_dictionary
].
The recommended approach is to store the serialized value and reuse that, to prevent repeating the cost of computing the hash factor and encoding.
Sourcepub fn num_tokens(&self) -> usize
pub fn num_tokens(&self) -> usize
Return the number of tokens in this BPE dictionary.
Sourcepub fn token_bytes(&self, token_id: u32) -> &[u8] ⓘ
pub fn token_bytes(&self, token_id: u32) -> &[u8] ⓘ
Converts a token id into its corresponding token bytes. Panics if the token_id is not within the valid 0..num_tokens() range!
Sourcepub fn token_len(&self, token_id: u32) -> usize
pub fn token_len(&self, token_id: u32) -> usize
Returns the length of the decoded byte slice of a token.
Sourcepub fn decode_tokens(&self, tokens: &[u32]) -> Vec<u8> ⓘ
pub fn decode_tokens(&self, tokens: &[u32]) -> Vec<u8> ⓘ
Decode a sequence of tokens back to its original byte sequence. Note: we don’t return here a str, since not every token sequence corresponds to a valid utf8 sequence.
Sourcepub fn count(&self, text: &[u8]) -> usize
pub fn count(&self, text: &[u8]) -> usize
Counts the number tokens produced when encoding the text.
Sourcepub fn count_till_limit(&self, text: &[u8], token_limit: usize) -> Option<usize>
pub fn count_till_limit(&self, text: &[u8], token_limit: usize) -> Option<usize>
Returns the token count iff the total token count stays below the specified token_limit
.
Otherwise, it returns none.
This function can be faster than count
when the token_limit is much smaller than the provided text.
pub fn encode_via_table(&self, text: &[u8]) -> Vec<u32>
pub fn encode_via_backtracking(&self, text: &[u8]) -> Vec<u32>
pub fn encode_via_bitfield(&self, text: &[u8]) -> Vec<u32>
Sourcepub fn encode_greedy(&self, text: &[u8]) -> Vec<u32>
pub fn encode_greedy(&self, text: &[u8]) -> Vec<u32>
It is not recommended to use this function, since it doesn’t output the correct BPE encoded sequence.
Sourcepub fn encode_minimal(&self, text: &[u8]) -> Vec<u32>
pub fn encode_minimal(&self, text: &[u8]) -> Vec<u32>
This function computes the shortest possible encoding sequence which will usually differ from the tokenization produced by the original BPE algorithm.
Trait Implementations§
Source§impl<'de> Deserialize<'de> for BytePairEncoding
impl<'de> Deserialize<'de> for BytePairEncoding
Source§fn deserialize<__D>(
__deserializer: __D,
) -> Result<BytePairEncoding, <__D as Deserializer<'de>>::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(
__deserializer: __D,
) -> Result<BytePairEncoding, <__D as Deserializer<'de>>::Error>where
__D: Deserializer<'de>,
Source§impl Serialize for BytePairEncoding
impl Serialize for BytePairEncoding
Source§fn serialize<__S>(
&self,
__serializer: __S,
) -> Result<<__S as Serializer>::Ok, <__S as Serializer>::Error>where
__S: Serializer,
fn serialize<__S>(
&self,
__serializer: __S,
) -> Result<<__S as Serializer>::Ok, <__S as Serializer>::Error>where
__S: Serializer,
Auto Trait Implementations§
impl Freeze for BytePairEncoding
impl RefUnwindSafe for BytePairEncoding
impl Send for BytePairEncoding
impl Sync for BytePairEncoding
impl Unpin for BytePairEncoding
impl UnwindSafe for BytePairEncoding
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more