Struct Encoding

Source
pub struct Encoding { /* private fields */ }

Implementations§

Source§

impl Encoding

Public interfaces for encoding

Source

pub fn encode_ordinary(&self, text: &str) -> Vec<usize>

Encodes a string into tokens, ignoring special tokens.

This is equivalent to encode(text, disallowed_special=()) (but slightly faster).

Source

pub fn encode_ordinary_batch(&self, texts: Vec<&str>) -> Vec<Vec<usize>>

Encodes a list of strings into tokens, in parallel, ignoring special tokens.

This is equivalent to encode_batch(text, disallowed_special=()) (but slightly faster).

Source

pub fn encode( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_>, ) -> Result<Vec<usize>>

Encodes a string into tokens. Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. So we want to be careful about accidentally encoding special tokens, since they can be used to trick a model into doing something we don’t want it to do. Hence, by default, encode will raise an error if it encounters text that corresponds to a special token. This can be controlled on a per-token level using the allowed_special and disallowed_special parameters. In particular:

  • Setting disallowed_special to () will prevent this function from raising errors and cause all text corresponding to special tokens to be encoded as natural text.
  • Setting allowed_special to “All” will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.
Source

pub fn encode_batch( &self, texts: Vec<&str>, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_>, ) -> Result<Vec<Vec<usize>>>

Encodes a list of strings into tokens, in parallel.

See encode for more details on allowed_special and disallowed_special.

Source

pub fn encode_with_unstable( &self, text: &str, allowed_special: AllowedSpecial<'_>, disallowed_special: DisallowedSpecial<'_>, ) -> Result<(Vec<usize>, Vec<Vec<usize>>)>

Encodes a string into stable tokens and possible completion sequences. Note that the stable tokens will only represent a substring of text. See encode for more details on allowed_special and disallowed_special. This API should itself be considered unstable.

Source

pub fn encode_single_token(&self, piece: &[u8]) -> Result<usize>

Encodes text corresponding to a single token to its token value.

NOTE: this will encode all special tokens.

Source§

impl Encoding

Public interfaces for decoding

Source

pub fn decode_bytes(&self, tokens: &[usize]) -> Vec<u8>

Decodes a list of tokens into bytes.

Source

pub fn decode_bytes_batch(self, batch: &[&[usize]]) -> Vec<Vec<u8>>

Decodes a batch (list of lists of tokens) into a list of bytes.

Source

pub fn decode(&self, tokens: &[usize], mode: DecodeMode) -> Result<String>

Decodes a list of tokens into a string.

WARNING: decoded bytes are not guaranteed to be valid UTF-8. You can control this behaviour using the mode parameter. Strict mode does validity check and returns Err if provided bytes are not UTF-8 Replace mode replaces invalid UTF-8 sequences with U+FFFD

Source

pub fn decode_batch( &self, batch: &[&[usize]], mode: DecodeMode, ) -> Vec<Result<String>>

Decodes a batch (list of lists of tokens) into a list of strings.

Source

pub fn decode_single_token_bytes(&self, token: usize) -> Result<Vec<u8>>

Decodes a token into bytes. NOTE: this will decode all special tokens.

Source

pub fn decode_tokens_bytes(&self, tokens: &Vec<usize>) -> Result<Vec<Vec<u8>>>

Decodes a list of tokens into a list of bytes. Useful for visualising tokenisation.

Source

pub fn decode_with_offsets( self, tokens: &Vec<usize>, ) -> Result<(String, Vec<usize>)>

Decodes a list of tokens into a string and a list of offsets. Each offset is the index into text corresponding to the start of each token. If UTF-8 character boundaries do not line up with token boundaries, the offset is the index of the first character that contains bytes from the token. This will currently raise if given tokens that decode to invalid UTF-8; this behaviour may change in the future to be more permissive.

enc.decode_with_offsets([31373, 995]) (‘hello world’, [0, 5])

Source§

impl Encoding

Miscellaneous interfaces

Source

pub fn name(&self) -> &str

Returns the name of this encoding

Source

pub fn token_byte_values(&self) -> Vec<Vec<u8>>

Returns the list of all token byte values.

Source

pub fn eot_token(&self) -> Option<usize>

Source

pub fn n_vocab(&self) -> usize

For backwards compatibility. Prefer to use enc.max_token_value + 1.

Source

pub fn special_tokens_set(&self) -> HashSet<&str>

Trait Implementations§

Source§

impl Debug for Encoding

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Display for Encoding

Display

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToString for T
where T: Display + ?Sized,

Source§

fn to_string(&self) -> String

Converts the given value to a String. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> ErasedDestructor for T
where T: 'static,

Source§

impl<T> MaybeSendSync for T