Struct tiktoken_rust::Encoding

source ·

pub struct Encoding { /* private fields */ }

Implementations§

source §

impl Encoding

Public interfaces for encoding

source

pub fn encode_ordinary(&self, text: &str) -> Vec<usize>

Encodes a string into tokens, ignoring special tokens.

This is equivalent to encode(text, disallowed_special=()) (but slightly faster).

source

pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>>

Encodes a list of strings into tokens, in parallel, ignoring special tokens.

This is equivalent to encode_batch(text, disallowed_special=()) (but slightly faster).

source

pub fn encode( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<Vec<usize>, EncodeError>

Encodes a string into tokens. Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. So we want to be careful about accidentally encoding special tokens, since they can be used to trick a model into doing something we don’t want it to do. Hence, by default, encode will raise an error if it encounters text that corresponds to a special token. This can be controlled on a per-token level using the allowed_special and disallowed_special parameters. In particular:

Setting disallowed_special to () will prevent this function from raising errors and cause all text corresponding to special tokens to be encoded as natural text.
Setting allowed_special to “All” will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.

source

pub fn encode_batch( &self, texts: Vec<&str>, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<Vec<Vec<usize>>, EncodeError>

Encodes a list of strings into tokens, in parallel.

See encode for more details on allowed_special and disallowed_special.

source

pub fn encode_with_unstable( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<(Vec<usize>, Vec<Vec<usize>>), EncodeError>

Encodes a string into stable tokens and possible completion sequences. Note that the stable tokens will only represent a substring of text. See encode for more details on allowed_special and disallowed_special. This API should itself be considered unstable.

source

pub fn encode_single_token(&self, piece: &[u8]) -> Result<usize, EncodeError>

Encodes text corresponding to a single token to its token value.

NOTE: this will encode all special tokens.

source §

impl Encoding

Public interfaces for decoding

source

pub fn decode_bytes(&self, tokens: &[usize]) -> Vec<u8> ⓘ

Decodes a list of tokens into bytes.

source

pub fn decode( &self, tokens: &Vec<usize>, mode: DecodeMode ) -> Result<String, EncodeError>

Decodes a list of tokens into a string.

WARNING: decoded bytes are not guaranteed to be valid UTF-8. You can control this behaviour using the mode parameter. Strict mode does validity check and returns Err if provided bytes are not UTF-8 Replace mode replaces invalid UTF-8 sequences with U+FFFD

source