pub struct IntervalEncoding<'a> { /* private fields */ }
Expand description
This data structure allows fast, i.e. typically O(1), counting of tokens for arbitrary substrings of the original input text.
It achieves this by precomputing for every position the last token which ends at this position.
These last tokens represent a token tree with its root being the empty input text where each path starting at the root represents
the encoded tokens of the corresponding text prefix.
The struct stores a topological ordering in tree_id
over this tree which then enables O(1) testing whether one node
is the predecessor of another node.
With the tree_depth
field the number of path length (which is equivalent to the number of encoded tokens) can be determined
in O(1) as well.
Note: the fields tree_end
and tree_depth
could also be represented by succinct data structures, reducing their size drastically.
Since we still need the tree_id
and last_token
fields, this would in total reduce memory footprint by a bit less than 50%.
Implementations§
Source§impl<'a> IntervalEncoding<'a>
impl<'a> IntervalEncoding<'a>
pub fn new(bpe: &'a BytePairEncoding, text: &'a [u8]) -> IntervalEncoding<'a>
Sourcepub fn count(&self, range: Range<usize>) -> usize
pub fn count(&self, range: Range<usize>) -> usize
Computes in typically O(1) time the number of tokens required to encode the specified range.
Thereby it reencodes the prefix with the BacktrackEncoder
until the encoding sequence becomes
compatible with the precomputed tables. Once that’s the case, the remainder of the range becomes
a simple O(1) lookup.
Note: in the worst-case the complexity is O(n). This happens for instance for a whitespace input where the encoding changes when the starting position changes.
Auto Trait Implementations§
impl<'a> Freeze for IntervalEncoding<'a>
impl<'a> RefUnwindSafe for IntervalEncoding<'a>
impl<'a> Send for IntervalEncoding<'a>
impl<'a> Sync for IntervalEncoding<'a>
impl<'a> Unpin for IntervalEncoding<'a>
impl<'a> UnwindSafe for IntervalEncoding<'a>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left
is true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self
into a Left
variant of Either<Self, Self>
if into_left(&self)
returns true
.
Converts self
into a Right
variant of Either<Self, Self>
otherwise. Read more