Struct tokenizers::tokenizer::normalizer::NormalizedString

source ·
pub struct NormalizedString { /* private fields */ }
Expand description

A NormalizedString takes care of processing an “original” string to modify it and obtain a “normalized” string. It keeps both version of the string, alignments information between both and provides an interface to retrieve ranges of each string, using offsets from any of them.

It is possible to retrieve a part of the original string, by indexing it with offsets from the normalized one, and the other way around too. It is also possible to convert offsets from one referential to the other one easily.

Implementations§

source§

impl NormalizedString

source

pub fn get(&self) -> &str

Return the normalized string

source

pub fn get_original(&self) -> &str

Return the original string

source

pub fn offsets_original(&self) -> Offsets

Return the original offsets

source

pub fn convert_offsets<T>(&self, range: Range<T>) -> Option<Range<usize>>
where T: RangeBounds<usize> + Clone,

Convert the given offsets range from one referential to the other one: Original => Normalized or Normalized => Original

Returns None when targeting something that is outside range

source

pub fn get_range<T>(&self, range: Range<T>) -> Option<&str>
where T: RangeBounds<usize> + Clone,

Return a range of the normalized string

source

pub fn get_range_original<T>(&self, range: Range<T>) -> Option<&str>
where T: RangeBounds<usize> + Clone,

Return a range of the original string

source

pub fn slice<T>(&self, range: Range<T>) -> Option<NormalizedString>
where T: RangeBounds<usize> + Clone,

Return a slice of the current NormalizedString If the range is not on char boundaries, return None

source

pub fn transform_range<T, I>( &mut self, range: Range<T>, dest: I, initial_offset: usize )
where T: RangeBounds<usize> + Clone, I: IntoIterator<Item = (char, isize)>,

Applies transformations to the current normalized version of the string, while updating the alignments. This method expect an Iterator yielding each char of the new normalized string with a change isize equals to:

  • 1 if this is a new char
  • -N if the char is right before N removed chars
  • 0 if the char is replacing the existing one Since it is possible that the normalized string doesn’t include some of the characters at the beginning of the original one, we need an initial_offset which represents the number of removed chars at the very beginning.
source

pub fn transform<I>(&mut self, dest: I, initial_offset: usize)
where I: IntoIterator<Item = (char, isize)>,

Applies transformations to the current normalized version of the string, while updating the alignments. This method expect an Iterator yielding each char of the new normalized string with a change isize equals to:

  • 1 if this is a new char
  • -N if the char is right before N removed chars
  • 0 if the char is replacing the existing one Since it is possible that the normalized string doesn’t include some of the characters at the beginning of the original one, we need an initial_offset which represents the number of removed chars at the very beginning.
source

pub fn nfd(&mut self) -> &mut Self

Applies NFD normalization

source

pub fn nfkd(&mut self) -> &mut Self

Applies NFKD normalization

source

pub fn nfc(&mut self) -> &mut Self

Applies NFC normalization

source

pub fn nfkc(&mut self) -> &mut Self

Applies NFKC normalization

source

pub fn filter<F: Fn(char) -> bool>(&mut self, keep: F) -> &mut Self

Applies filtering over our characters

source

pub fn prepend(&mut self, s: &str) -> &mut Self

Prepend the given string to ourself

source

pub fn append(&mut self, s: &str) -> &mut Self

Append the given string to ourself

source

pub fn map<F: Fn(char) -> char>(&mut self, map: F) -> &mut Self

Map our characters

source

pub fn for_each<F: FnMut(char)>(&self, foreach: F) -> &Self

Calls the given function for each characters

source

pub fn lowercase(&mut self) -> &mut Self

Lowercase

source

pub fn uppercase(&mut self) -> &mut Self

Uppercase

source

pub fn replace<P: Pattern>(&mut self, pattern: P, content: &str) -> Result<()>

Replace anything that matches the pattern with the given content.

source

pub fn clear(&mut self) -> usize

Clear the normalized part of the string

source

pub fn split<P: Pattern>( &self, pattern: P, behavior: SplitDelimiterBehavior ) -> Result<Vec<NormalizedString>>

Split the current string in many subparts. Specify what to do with the delimiter.

§Splitting Behavior for the delimiter

The behavior can be one of the followings: When splitting on '-' for example, with input the-final--countdown:

  • Removed => [ "the", "", "final", "", "", "countdown" ]
  • Isolated => [ "the", "-", "final", "-", "-", "countdown" ]
  • MergedWithPrevious => [ "the-", "final-", "-", "countdown" ]
  • MergedWithNext => [ "the", "-final", "-", "-countdown" ]
source

pub fn lstrip(&mut self) -> &mut Self

Remove any leading space(s) of the normalized string

source

pub fn rstrip(&mut self) -> &mut Self

Remove any trailing space(s) of the normalized string

source

pub fn strip(&mut self) -> &mut Self

Remove any leading and trailing space(s) of the normalized string

source

pub fn len(&self) -> usize

Returns the length of the normalized string (counting chars not bytes)

source

pub fn len_original(&self) -> usize

Returns the length of the original string (counting chars not bytes)

source

pub fn is_empty(&self) -> bool

Whether empty

Trait Implementations§

source§

impl Clone for NormalizedString

source§

fn clone(&self) -> NormalizedString

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for NormalizedString

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Default for NormalizedString

source§

fn default() -> NormalizedString

Returns the “default value” for a type. Read more
source§

impl From<&str> for NormalizedString

source§

fn from(s: &str) -> Self

Converts to this type from the input type.
source§

impl From<NormalizedString> for PreTokenizedString

source§

fn from(s: NormalizedString) -> Self

Converts to this type from the input type.
source§

impl From<NormalizedString> for Split

source§

fn from(n: NormalizedString) -> Self

Converts to this type from the input type.
source§

impl From<String> for NormalizedString

source§

fn from(s: String) -> Self

Converts to this type from the input type.
source§

impl PartialEq for NormalizedString

source§

fn eq(&self, other: &NormalizedString) -> bool

This method tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

This method tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl Eq for NormalizedString

source§

impl StructuralPartialEq for NormalizedString

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> IntoEither for T

source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
source§

impl<T> Pointable for T

source§

const ALIGN: usize = _

The alignment of pointer.
§

type Init = T

The type for initializers.
source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

source§

fn vzip(self) -> V