pub struct CoreBPE { /* private fields */ }Implementations§
Source§impl CoreBPE
Rust API
impl CoreBPE
Rust API
pub fn new( encoder: HashMap<Vec<u8>, Rank>, special_tokens_encoder: HashMap<String, Rank>, pattern: &str, ) -> Result<Self>
Sourcepub fn decode(&self, tokens: Vec<Rank>) -> Result<String>
pub fn decode(&self, tokens: Vec<Rank>) -> Result<String>
Decode a vector of tokens into a valid UTF-8 String
If unicode validation is not wanted, see _decode_native.
pub fn _decode_native_and_split( &self, tokens: Vec<Rank>, ) -> impl Iterator<Item = Vec<u8>> + '_
Sourcepub fn split_by_token<'a>(
&'a self,
text: &'a str,
use_special_tokens: bool,
) -> Result<Vec<String>>
pub fn split_by_token<'a>( &'a self, text: &'a str, use_special_tokens: bool, ) -> Result<Vec<String>>
Tokenize a string and return the decoded tokens using the correct BPE model.
This method takes a string, encodes it using the BPE model, and decodes the encoded tokens into a vector of strings. It can be used to tokenize a string and return the decoded tokens using the correct BPE model.
§Examples
use tiktoken_rs::cl100k_base;
let bpe = cl100k_base().unwrap();
let tokenized: Result<Vec<_>, _> = bpe
.split_by_token("This is a test with a lot of spaces", true);
let tokenized = tokenized.unwrap();
assert_eq!(
tokenized,
vec!["This", " is", " a", " test", " ", " with", " a", " lot", " of", " spaces"]
);§Arguments
- text: A string slice containing the text to be tokenized.
- use_special_tokens: A boolean indicating whether to use the special tokens in the BPE model.
§Returns
Result<Vec<String>>: A Result containing a vector of decoded tokens as strings, or an error if the string cannot be converted into a valid UTF-8 string.
§Errors
This function will return an error if:
- The input text cannot be converted into a valid UTF-8 string during the decoding process.
Sourcepub fn split_by_token_iter<'a>(
&'a self,
text: &'a str,
use_special_tokens: bool,
) -> impl Iterator<Item = Result<String>> + 'a
pub fn split_by_token_iter<'a>( &'a self, text: &'a str, use_special_tokens: bool, ) -> impl Iterator<Item = Result<String>> + 'a
Iterator for decoding and splitting a String.
See split_by_token for more details.
Source§impl CoreBPE
impl CoreBPE
pub fn encode_ordinary(&self, text: &str) -> Vec<Rank> ⓘ
pub fn encode( &self, text: &str, allowed_special: &HashSet<&str>, ) -> (Vec<Rank>, usize)
pub fn _encode_unstable_native( &self, text: &str, allowed_special: &HashSet<&str>, ) -> (Vec<Rank>, HashSet<Vec<Rank>>)
pub fn special_tokens(&self) -> HashSet<&str>
pub fn encode_with_special_tokens(&self, text: &str) -> Vec<Rank> ⓘ
Trait Implementations§
Auto Trait Implementations§
impl Freeze for CoreBPE
impl RefUnwindSafe for CoreBPE
impl Send for CoreBPE
impl Sync for CoreBPE
impl Unpin for CoreBPE
impl UnwindSafe for CoreBPE
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more