pub struct Encoding { /* private fields */ }
Implementations§
Source§impl Encoding
Public interfaces for encoding
impl Encoding
Public interfaces for encoding
Sourcepub fn encode_ordinary(&self, text: &str) -> Vec<usize>
pub fn encode_ordinary(&self, text: &str) -> Vec<usize>
Encodes a string into tokens, ignoring special tokens.
This is equivalent to encode(text, disallowed_special=())
(but slightly faster).
Sourcepub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>>
pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>>
Encodes a list of strings into tokens, in parallel, ignoring special tokens.
This is equivalent to encode_batch(text, disallowed_special=())
(but slightly faster).
Sourcepub fn encode(
&self,
text: &str,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>,
) -> Result<Vec<usize>>
pub fn encode( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_>, ) -> Result<Vec<usize>>
Encodes a string into tokens.
Special tokens are artificial tokens used to unlock capabilities from a model,
such as fill-in-the-middle. So we want to be careful about accidentally encoding special
tokens, since they can be used to trick a model into doing something we don’t want it to do.
Hence, by default, encode will raise an error if it encounters text that corresponds
to a special token. This can be controlled on a per-token level using the allowed_special
and disallowed_special
parameters. In particular:
- Setting
disallowed_special
to () will prevent this function from raising errors and cause all text corresponding to special tokens to be encoded as natural text. - Setting
allowed_special
to “All” will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.
Sourcepub fn encode_batch(
&self,
texts: Vec<&str>,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>,
) -> Result<Vec<Vec<usize>>>
pub fn encode_batch( &self, texts: Vec<&str>, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_>, ) -> Result<Vec<Vec<usize>>>
Encodes a list of strings into tokens, in parallel.
See encode
for more details on allowed_special
and disallowed_special
.
Sourcepub fn encode_with_unstable(
&self,
text: &str,
allowed_special: AllowedSpecial<'_>,
disallowed_special: DisallowedSpecial<'_>,
) -> Result<(Vec<usize>, Vec<Vec<usize>>)>
pub fn encode_with_unstable( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_>, ) -> Result<(Vec<usize>, Vec<Vec<usize>>)>
Encodes a string into stable tokens and possible completion sequences.
Note that the stable tokens will only represent a substring of text
.
See encode
for more details on allowed_special
and disallowed_special
.
This API should itself be considered unstable.
Sourcepub fn encode_single_token(&self, piece: &[u8]) -> Result<usize>
pub fn encode_single_token(&self, piece: &[u8]) -> Result<usize>
Encodes text corresponding to a single token to its token value.
NOTE: this will encode all special tokens.
Source§impl Encoding
Public interfaces for decoding
impl Encoding
Public interfaces for decoding
Sourcepub fn decode_bytes_batch(self, batch: &[&[usize]]) -> Vec<Vec<u8>>
pub fn decode_bytes_batch(self, batch: &[&[usize]]) -> Vec<Vec<u8>>
Decodes a batch (list of lists of tokens) into a list of bytes.
Sourcepub fn decode(&self, tokens: &[usize], mode: DecodeMode) -> Result<String>
pub fn decode(&self, tokens: &[usize], mode: DecodeMode) -> Result<String>
Decodes a list of tokens into a string.
WARNING: decoded bytes are not guaranteed to be valid UTF-8.
You can control this behaviour using the mode
parameter.
Strict
mode does validity check and returns Err if provided bytes are not UTF-8
Replace
mode replaces invalid UTF-8 sequences with U+FFFD
Sourcepub fn decode_batch(
&self,
batch: &[&[usize]],
mode: DecodeMode,
) -> Vec<Result<String>> ⓘ
pub fn decode_batch( &self, batch: &[&[usize]], mode: DecodeMode, ) -> Vec<Result<String>> ⓘ
Decodes a batch (list of lists of tokens) into a list of strings.
Sourcepub fn decode_single_token_bytes(&self, token: usize) -> Result<Vec<u8>>
pub fn decode_single_token_bytes(&self, token: usize) -> Result<Vec<u8>>
Decodes a token into bytes. NOTE: this will decode all special tokens.
Sourcepub fn decode_tokens_bytes(&self, tokens: &Vec<usize>) -> Result<Vec<Vec<u8>>>
pub fn decode_tokens_bytes(&self, tokens: &Vec<usize>) -> Result<Vec<Vec<u8>>>
Decodes a list of tokens into a list of bytes. Useful for visualising tokenisation.
Sourcepub fn decode_with_offsets(
self,
tokens: &Vec<usize>,
) -> Result<(String, Vec<usize>)>
pub fn decode_with_offsets( self, tokens: &Vec<usize>, ) -> Result<(String, Vec<usize>)>
Decodes a list of tokens into a string and a list of offsets. Each offset is the index into text corresponding to the start of each token. If UTF-8 character boundaries do not line up with token boundaries, the offset is the index of the first character that contains bytes from the token. This will currently raise if given tokens that decode to invalid UTF-8; this behaviour may change in the future to be more permissive.
enc.decode_with_offsets([31373, 995]) (‘hello world’, [0, 5])
Source§impl Encoding
Miscellaneous interfaces
impl Encoding
Miscellaneous interfaces
Sourcepub fn token_byte_values(&self) -> Vec<Vec<u8>>
pub fn token_byte_values(&self) -> Vec<Vec<u8>>
Returns the list of all token byte values.
pub fn eot_token(&self) -> Option<usize>
Sourcepub fn n_vocab(&self) -> usize
pub fn n_vocab(&self) -> usize
For backwards compatibility. Prefer to use enc.max_token_value + 1
.
pub fn special_tokens_set(&self) -> HashSet<&str>
Trait Implementations§
Auto Trait Implementations§
impl Freeze for Encoding
impl RefUnwindSafe for Encoding
impl Send for Encoding
impl Sync for Encoding
impl Unpin for Encoding
impl UnwindSafe for Encoding
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more