pub struct PreTokenizedString { /* private fields */ }
Expand description

The PreTokenizedString is in charge of splitting an underlying string, making sure everything is fine while doing so, and providing ways to normalize and tokenize these splits. Once everything has been normalized and tokenized, the PreTokenizedString is able to build an Encoding with all the relevant offsets and word ids, relative to the original string.

Implementations§

source§

impl PreTokenizedString

source

pub fn split<F, U, R>(&mut self, split_fn: F) -> Result<()>
where F: FnMut(usize, NormalizedString) -> Result<U>, U: IntoIterator<Item = R>, R: Into<Split>,

Split the PreTokenizedString by providing a split_fn in charge of splitting each substring (NormalizedString) into multiple parts.

split_fn takes a NormalizedString and is in charge of returning an iterator over the produced NormalizedString. split_fn is free of modifying these NormalizedString as relevant, as long as it respects the constraint stated below.

There are only one constraint that MUST be respected:

The produced NormalizedString, if combined back together, must have the same original string as the original one given to split_fn. This concretely means that for the offset tracking to work as expected, split_fn must produce “splits” of the original string.

source

pub fn normalize<F>(&mut self, normalize: F) -> Result<()>
where F: Fn(&mut NormalizedString) -> Result<()>,

Normalized all the splits that do not have attached Tokens, using the provided normalize function.

source

pub fn tokenize<F>(&mut self, tokenize: F) -> Result<()>

Tokenize all the splits that do not have attached Tokens, using the provided tokenize function

source

pub fn into_encoding( self, word_idx: Option<u32>, type_id: u32, offset_type: OffsetType ) -> Result<Encoding>

Transform the current PreTokenizedString into an Encoding.

If a word_idx is provided, any word in the generated Encoding will be set to this value. This is generally used with pre-tokenized input, that do not need the PreTokenizedString to generate word ids.

This method will fail if some splits do not have associated Token.

source

pub fn get_splits( &self, offset_ref: OffsetReferential, offset_type: OffsetType ) -> Vec<(&str, Offsets, &Option<Vec<Token>>)>

Returns a list of splits, each of them being a slice of the normalized string, the associated offsets either in original or normalized referential, as well as the potention tokens

Trait Implementations§

source§

impl Clone for PreTokenizedString

source§

fn clone(&self) -> PreTokenizedString

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for PreTokenizedString

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl From<&str> for PreTokenizedString

source§

fn from(s: &str) -> Self

Converts to this type from the input type.
source§

impl From<NormalizedString> for PreTokenizedString

source§

fn from(s: NormalizedString) -> Self

Converts to this type from the input type.
source§

impl From<String> for PreTokenizedString

source§

fn from(s: String) -> Self

Converts to this type from the input type.
source§

impl PartialEq for PreTokenizedString

source§

fn eq(&self, other: &PreTokenizedString) -> bool

This method tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

This method tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl Eq for PreTokenizedString

source§

impl StructuralPartialEq for PreTokenizedString

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> IntoEither for T

source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

impl<T> Pointable for T

source§

const ALIGN: usize = _

The alignment of pointer.
§

type Init = T

The type for initializers.
source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V