Struct tiktoken_rust::Encoding
source · pub struct Encoding { /* private fields */ }
Implementations§
source§impl Encoding
impl Encoding
Public interfaces for encoding
sourcepub fn encode_ordinary(&self, text: &str) -> Vec<usize>
pub fn encode_ordinary(&self, text: &str) -> Vec<usize>
Encodes a string into tokens, ignoring special tokens.
This is equivalent to encode(text, disallowed_special=())
(but slightly faster).
sourcepub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>>
pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>>
Encodes a list of strings into tokens, in parallel, ignoring special tokens.
This is equivalent to encode_batch(text, disallowed_special=())
(but slightly faster).
sourcepub fn encode(
&self,
text: &str,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>
) -> Result<Vec<usize>, EncodeError>
pub fn encode( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<Vec<usize>, EncodeError>
Encodes a string into tokens.
Special tokens are artificial tokens used to unlock capabilities from a model,
such as fill-in-the-middle. So we want to be careful about accidentally encoding special
tokens, since they can be used to trick a model into doing something we don’t want it to do.
Hence, by default, encode will raise an error if it encounters text that corresponds
to a special token. This can be controlled on a per-token level using the allowed_special
and disallowed_special
parameters. In particular:
- Setting
disallowed_special
to () will prevent this function from raising errors and cause all text corresponding to special tokens to be encoded as natural text. - Setting
allowed_special
to “All” will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.
sourcepub fn encode_batch(
&self,
texts: Vec<&str>,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>
) -> Result<Vec<Vec<usize>>, EncodeError>
pub fn encode_batch( &self, texts: Vec<&str>, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<Vec<Vec<usize>>, EncodeError>
Encodes a list of strings into tokens, in parallel.
See encode
for more details on allowed_special
and disallowed_special
.
sourcepub fn encode_with_unstable(
&self,
text: &str,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>
) -> Result<(Vec<usize>, Vec<Vec<usize>>), EncodeError>
pub fn encode_with_unstable( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<(Vec<usize>, Vec<Vec<usize>>), EncodeError>
Encodes a string into stable tokens and possible completion sequences.
Note that the stable tokens will only represent a substring of text
.
See encode
for more details on allowed_special
and disallowed_special
.
This API should itself be considered unstable.
sourcepub fn encode_single_token(&self, piece: &[u8]) -> Result<usize, EncodeError>
pub fn encode_single_token(&self, piece: &[u8]) -> Result<usize, EncodeError>
Encodes text corresponding to a single token to its token value.
NOTE: this will encode all special tokens.
source§impl Encoding
impl Encoding
Public interfaces for decoding
sourcepub fn decode(
&self,
tokens: &Vec<usize>,
mode: DecodeMode
) -> Result<String, EncodeError>
pub fn decode( &self, tokens: &Vec<usize>, mode: DecodeMode ) -> Result<String, EncodeError>
Decodes a list of tokens into a string.
WARNING: decoded bytes are not guaranteed to be valid UTF-8.
You can control this behaviour using the mode
parameter.
Strict
mode does validity check and returns Err if provided bytes are not UTF-8
Replace
mode replaces invalid UTF-8 sequences with U+FFFD
sourcepub fn decode_single_token_bytes(
&self,
token: usize
) -> Result<Vec<u8>, EncodeError>
pub fn decode_single_token_bytes( &self, token: usize ) -> Result<Vec<u8>, EncodeError>
Decodes a token into bytes. NOTE: this will decode all special tokens.
sourcepub fn decode_tokens_bytes(
&self,
tokens: &Vec<usize>
) -> Result<Vec<Vec<u8>>, EncodeError>
pub fn decode_tokens_bytes( &self, tokens: &Vec<usize> ) -> Result<Vec<Vec<u8>>, EncodeError>
Decodes a list of tokens into a list of bytes. Useful for visualising tokenisation.
source§impl Encoding
impl Encoding
Miscellaneous interfaces
sourcepub fn token_byte_values(&self) -> Vec<Vec<u8>>
pub fn token_byte_values(&self) -> Vec<Vec<u8>>
Returns the list of all token byte values.
pub fn eot_token(&self) -> Option<usize>
sourcepub fn n_vocab(&self) -> usize
pub fn n_vocab(&self) -> usize
For backwards compatibility. Prefer to use enc.max_token_value + 1
.