Module interval_encoding

Source

Structsยง

IntervalEncoding
This data structure allows fast, i.e. typically O(1), counting of tokens for arbitrary substrings of the original input text. It achieves this by precomputing for every position the last token which ends at this position. These last tokens represent a token tree with its root being the empty input text where each path starting at the root represents the encoded tokens of the corresponding text prefix. The struct stores a topological ordering in tree_id over this tree which then enables O(1) testing whether one node is the predecessor of another node. With the tree_depth field the number of path length (which is equivalent to the number of encoded tokens) can be determined in O(1) as well.