Struct IntervalEncoding

Source
pub struct IntervalEncoding<'a> { /* private fields */ }
Expand description

This data structure allows fast, i.e. typically O(1), counting of tokens for arbitrary substrings of the original input text. It achieves this by precomputing for every position the last token which ends at this position. These last tokens represent a token tree with its root being the empty input text where each path starting at the root represents the encoded tokens of the corresponding text prefix. The struct stores a topological ordering in tree_id over this tree which then enables O(1) testing whether one node is the predecessor of another node. With the tree_depth field the number of path length (which is equivalent to the number of encoded tokens) can be determined in O(1) as well.

Note: the fields tree_end and tree_depth could also be represented by succinct data structures, reducing their size drastically. Since we still need the tree_id and last_token fields, this would in total reduce memory footprint by a bit less than 50%.

Implementations§

Source§

impl<'a> IntervalEncoding<'a>

Source

pub fn new(bpe: &'a BytePairEncoding, text: &'a [u8]) -> IntervalEncoding<'a>

Source

pub fn count(&self, range: Range<usize>) -> usize

Computes in typically O(1) time the number of tokens required to encode the specified range. Thereby it reencodes the prefix with the BacktrackEncoder until the encoding sequence becomes compatible with the precomputed tables. Once that’s the case, the remainder of the range becomes a simple O(1) lookup.

Note: in the worst-case the complexity is O(n). This happens for instance for a whitespace input where the encoding changes when the starting position changes.

Auto Trait Implementations§

§

impl<'a> Freeze for IntervalEncoding<'a>

§

impl<'a> RefUnwindSafe for IntervalEncoding<'a>

§

impl<'a> Send for IntervalEncoding<'a>

§

impl<'a> Sync for IntervalEncoding<'a>

§

impl<'a> Unpin for IntervalEncoding<'a>

§

impl<'a> UnwindSafe for IntervalEncoding<'a>

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.