Struct general_sam::utils::tokenize::GreedyTokenizer
source · pub struct GreedyTokenizer<'s, TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq> { /* private fields */ }Available on crate feature
utils only.Expand description
Greedy tokenizer with a general suffix automaton of the vocabulary.
Assuming that the input length is $n$, the maximum word length is $l$, and querying transitions in the trie takes $\mathcal{O}\left(\log{\Sigma}\right)$ time, then the overall time complexity of this implementation is $\mathcal{O}\left( n \cdot \left( \log{l} + \log{\Sigma} \right) \right)$.
The main optimization is to store suffix-wise information with persistent ropes. For each suffix in a state of the suffix automaton, the longest word matching the prefix of the suffix is stored in the rope. And the information stored in a state will be further merged in the ropes of its successors.
Implementations§
source§impl<'s, TransTable: TransitionTable> GreedyTokenizer<'s, TransTable, TrieNodeID>
impl<'s, TransTable: TransitionTable> GreedyTokenizer<'s, TransTable, TrieNodeID>
pub fn build_from_trie<'t, TT: TransitionTable<KeyType = TransTable::KeyType>>( sam: &'s GeneralSAM<TransTable>, trie_state: TrieState<'t, TT> ) -> Self
Available on crate feature
trie only.source§impl<'s, TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq> GreedyTokenizer<'s, TransTable, TokenIDType>
impl<'s, TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq> GreedyTokenizer<'s, TransTable, TokenIDType>
pub fn build<TN: TrieNodeAlike<InnerType = TransTable::KeyType>, F: FnMut(&TN) -> TokenIDType>( sam: &'s GeneralSAM<TransTable>, trie_node: TN, f: F ) -> Self
pub fn tokenize<Iter: Iterator<Item = TransTable::KeyType>>( &self, iter: Iter, unk_token_id: &TokenIDType ) -> Vec<(TokenIDType, usize)>
Trait Implementations§
source§impl<'s, TransTable: Clone + TransitionTable, TokenIDType: Clone + Clone + Default + PartialEq> Clone for GreedyTokenizer<'s, TransTable, TokenIDType>
impl<'s, TransTable: Clone + TransitionTable, TokenIDType: Clone + Clone + Default + PartialEq> Clone for GreedyTokenizer<'s, TransTable, TokenIDType>
source§fn clone(&self) -> GreedyTokenizer<'s, TransTable, TokenIDType>
fn clone(&self) -> GreedyTokenizer<'s, TransTable, TokenIDType>
Returns a copy of the value. Read more
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source. Read moreAuto Trait Implementations§
impl<'s, TransTable, TokenIDType> RefUnwindSafe for GreedyTokenizer<'s, TransTable, TokenIDType>where
TokenIDType: RefUnwindSafe,
TransTable: RefUnwindSafe,
impl<'s, TransTable, TokenIDType> Send for GreedyTokenizer<'s, TransTable, TokenIDType>
impl<'s, TransTable, TokenIDType> Sync for GreedyTokenizer<'s, TransTable, TokenIDType>
impl<'s, TransTable, TokenIDType> Unpin for GreedyTokenizer<'s, TransTable, TokenIDType>
impl<'s, TransTable, TokenIDType> UnwindSafe for GreedyTokenizer<'s, TransTable, TokenIDType>where
TokenIDType: RefUnwindSafe,
TransTable: RefUnwindSafe,
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more