CoreBPE

Struct CoreBPE

pub struct CoreBPE { /* private fields */ }

Implementations§

impl CoreBPE

Rust API

pub fn new( encoder: HashMap<Vec<u8>, Rank>, special_tokens_encoder: HashMap<String, Rank>, pattern: &str, ) -> Result<Self>

pub fn decode(&self, tokens: Vec<Rank>) -> Result<String>

Decode a vector of tokens into a valid UTF-8 String

If unicode validation is not wanted, see _decode_native.

pub fn _decode_native_and_split( &self, tokens: Vec<Rank>, ) -> impl Iterator<Item = Vec<u8>> + '_

pub fn split_by_token<'a>( &'a self, text: &'a str, use_special_tokens: bool, ) -> Result<Vec<String>>

Tokenize a string and return the decoded tokens using the correct BPE model.

This method takes a string, encodes it using the BPE model, and decodes the encoded tokens into a vector of strings. It can be used to tokenize a string and return the decoded tokens using the correct BPE model.

§Examples

    use tiktoken_rs::cl100k_base;
    let bpe = cl100k_base().unwrap();
    let tokenized: Result<Vec<_>, _> = bpe
        .split_by_token("This is a test         with a lot of spaces", true);
    let tokenized = tokenized.unwrap();
    assert_eq!(
        tokenized,
        vec!["This", " is", " a", " test", "        ", " with", " a", " lot", " of", " spaces"]
    );

§Arguments

text: A string slice containing the text to be tokenized.
use_special_tokens: A boolean indicating whether to use the special tokens in the BPE model.

§Returns

Result<Vec<String>>: A Result containing a vector of decoded tokens as strings, or an error if the string cannot be converted into a valid UTF-8 string.

§Errors

This function will return an error if:

The input text cannot be converted into a valid UTF-8 string during the decoding process.

pub fn split_by_token_iter<'a>( &'a self, text: &'a str, use_special_tokens: bool, ) -> impl Iterator<Item = Result<String>> + 'a

Iterator for decoding and splitting a String. See split_by_token for more details.

pub fn split_by_token_ordinary<'a>( &'a self, text: &'a str, ) -> Result<Vec<String>>

Tokenize a string and return the decoded tokens using the correct BPE model. This method is equivalent to split_by_token(text, false).

pub fn split_by_token_ordinary_iter<'a>( &'a self, text: &'a str, ) -> impl Iterator<Item = Result<String>> + 'a

Iterator for decoding and splitting a String. This method is equivalent to split_by_token_iter(text, false).

impl CoreBPE

pub fn encode_ordinary(&self, text: &str) -> Vec<Rank> ⓘ

pub fn encode( &self, text: &str, allowed_special: &HashSet<&str>, ) -> (Vec<Rank>, usize)

pub fn _encode_unstable_native( &self, text: &str, allowed_special: &HashSet<&str>, ) -> (Vec<Rank>, HashSet<Vec<Rank>>)

pub fn special_tokens(&self) -> HashSet<&str>

pub fn encode_with_special_tokens(&self, text: &str) -> Vec<Rank> ⓘ

Trait Implementations§

impl Clone for CoreBPE

fn clone(&self) -> CoreBPE

Returns a duplicate of the value. Read more

1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more

Auto Trait Implementations§

impl Freeze for CoreBPE

impl RefUnwindSafe for CoreBPE

impl Send for CoreBPE

impl Sync for CoreBPE

impl Unpin for CoreBPE

impl UnwindSafe for CoreBPE

Blanket Implementations§

impl<T> Any for T
where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

impl<T> Borrow<T> for T
where T: ?Sized,

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

impl<T> BorrowMut<T> for T
where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

impl<T> CloneToUninit for T
where T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)

Performs copy-assignment from self to dest. Read more

impl<T> From<T> for T

fn from(t: T) -> T

Returns the argument unchanged.

impl<T, U> Into<U> for T
where U: From<T>,

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T> ToOwned for T
where T: Clone,

type Owned = T

The resulting type after obtaining ownership.

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more

impl<T, U> TryFrom<U> for T
where U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.