Struct tiktoken_rust::Encoding
source · pub struct Encoding { /* private fields */ }
Implementations§
source§impl Encoding
impl Encoding
Public interfaces for encoding
sourcepub fn encode_ordinary(&self, text: &str) -> Vec<usize>
pub fn encode_ordinary(&self, text: &str) -> Vec<usize>
Encodes a string into tokens, ignoring special tokens.
This is equivalent to encode(text, disallowed_special=())
(but slightly faster).
sourcepub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>>
pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>>
Encodes a list of strings into tokens, in parallel, ignoring special tokens.
This is equivalent to encode_batch(text, disallowed_special=())
(but slightly faster).
sourcepub fn encode(
&self,
text: &str,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>
) -> Result<Vec<usize>>
pub fn encode( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<Vec<usize>>
Encodes a string into tokens.
Special tokens are artificial tokens used to unlock capabilities from a model,
such as fill-in-the-middle. So we want to be careful about accidentally encoding special
tokens, since they can be used to trick a model into doing something we don’t want it to do.
Hence, by default, encode will raise an error if it encounters text that corresponds
to a special token. This can be controlled on a per-token level using the allowed_special
and disallowed_special
parameters. In particular:
- Setting
disallowed_special
to () will prevent this function from raising errors and cause all text corresponding to special tokens to be encoded as natural text. - Setting
allowed_special
to “All” will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.
sourcepub fn encode_batch(
&self,
texts: Vec<&str>,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>
) -> Result<Vec<Vec<usize>>>
pub fn encode_batch( &self, texts: Vec<&str>, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<Vec<Vec<usize>>>
Encodes a list of strings into tokens, in parallel.
See encode
for more details on allowed_special
and disallowed_special
.
sourcepub fn encode_with_unstable(
&self,
text: &str,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>
) -> Result<(Vec<usize>, Vec<Vec<usize>>)>
pub fn encode_with_unstable( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_> ) -> Result<(Vec<usize>, Vec<Vec<usize>>)>
Encodes a string into stable tokens and possible completion sequences.
Note that the stable tokens will only represent a substring of text
.
See encode
for more details on allowed_special
and disallowed_special
.
This API should itself be considered unstable.
sourcepub fn encode_single_token(&self, piece: &[u8]) -> Result<usize>
pub fn encode_single_token(&self, piece: &[u8]) -> Result<usize>
Encodes text corresponding to a single token to its token value.
NOTE: this will encode all special tokens.
source§impl Encoding
impl Encoding
Public interfaces for decoding
sourcepub fn decode(&self, tokens: &[usize], mode: DecodeMode) -> Result<String>
pub fn decode(&self, tokens: &[usize], mode: DecodeMode) -> Result<String>
Decodes a list of tokens into a string.
WARNING: decoded bytes are not guaranteed to be valid UTF-8.
You can control this behaviour using the mode
parameter.
Strict
mode does validity check and returns Err if provided bytes are not UTF-8
Replace
mode replaces invalid UTF-8 sequences with U+FFFD
source§impl Encoding
impl Encoding
Miscellaneous interfaces
sourcepub fn token_byte_values(&self) -> Vec<Vec<u8>>
pub fn token_byte_values(&self) -> Vec<Vec<u8>>
Returns the list of all token byte values.
pub fn eot_token(&self) -> Option<usize>
sourcepub fn n_vocab(&self) -> usize
pub fn n_vocab(&self) -> usize
For backwards compatibility. Prefer to use enc.max_token_value + 1
.