Struct general_sam::utils::tokenize::GreedyTokenizer
source · pub struct GreedyTokenizer<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq, SAMRef: Deref<Target = GeneralSAM<TransTable>>> { /* private fields */ }Available on crate feature
utils only.Expand description
Greedy tokenizer with a general suffix automaton of the vocabulary.
Assuming that the input length is $n$, the maximum word length is $l$, and querying transitions in the trie takes $\mathcal{O}\left(\log{\Sigma}\right)$ time, then the overall time complexity of this implementation is $\mathcal{O}\left( n \cdot \left( \log{l} + \log{\Sigma} \right) \right)$.
The main optimization is to store suffix-wise information with persistent ropes. For each suffix in a state of the suffix automaton, the longest word matching the prefix of the suffix is stored in the rope. And the information stored in a state will be further merged in the ropes of its successors.
Implementations§
source§impl<TransTable: TransitionTable, SAMRef: Deref<Target = GeneralSAM<TransTable>>> GreedyTokenizer<TransTable, TrieNodeID, SAMRef>
impl<TransTable: TransitionTable, SAMRef: Deref<Target = GeneralSAM<TransTable>>> GreedyTokenizer<TransTable, TrieNodeID, SAMRef>
pub fn build_from_trie<TT: TransitionTable<KeyType = TransTable::KeyType>>( sam: SAMRef, trie_state: TrieState<TT, &Trie<TT>> ) -> Self
Available on crate feature
trie only.source§impl<TransTable: TransitionTable> GreedyTokenizer<TransTable, TrieNodeID, OwnedGeneralSAM<TransTable>>
impl<TransTable: TransitionTable> GreedyTokenizer<TransTable, TrieNodeID, OwnedGeneralSAM<TransTable>>
pub fn build_from_sam_and_trie<TT: TransitionTable<KeyType = TransTable::KeyType>>( sam: GeneralSAM<TransTable>, trie_state: TrieState<TT, &Trie<TT>> ) -> Self
Available on crate feature
trie only.source§impl<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq> GreedyTokenizer<TransTable, TokenIDType, OwnedGeneralSAM<TransTable>>
impl<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq> GreedyTokenizer<TransTable, TokenIDType, OwnedGeneralSAM<TransTable>>
pub fn build_from_sam<TN: TrieNodeAlike<InnerType = TransTable::KeyType>, F: FnMut(&TN) -> TokenIDType>( sam: GeneralSAM<TransTable>, trie_node: TN, f: F ) -> Self
source§impl<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq, SAMRef: Deref<Target = GeneralSAM<TransTable>>> GreedyTokenizer<TransTable, TokenIDType, SAMRef>
impl<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq, SAMRef: Deref<Target = GeneralSAM<TransTable>>> GreedyTokenizer<TransTable, TokenIDType, SAMRef>
pub fn get_sam_ref(&self) -> &GeneralSAM<TransTable>
pub fn inner_as_ref( &self ) -> GreedyTokenizer<TransTable, TokenIDType, &GeneralSAM<TransTable>>
pub fn build<TN: TrieNodeAlike<InnerType = TransTable::KeyType>, F: FnMut(&TN) -> TokenIDType>( sam: SAMRef, trie_node: TN, f: F ) -> Self
pub fn tokenize<Iter: Iterator<Item = TransTable::KeyType>>( &self, iter: Iter, unk_token_id: &TokenIDType ) -> Vec<(TokenIDType, usize)>
Trait Implementations§
source§impl<TransTable: Clone + TransitionTable, TokenIDType: Clone + Clone + Default + PartialEq, SAMRef: Clone + Deref<Target = GeneralSAM<TransTable>>> Clone for GreedyTokenizer<TransTable, TokenIDType, SAMRef>
impl<TransTable: Clone + TransitionTable, TokenIDType: Clone + Clone + Default + PartialEq, SAMRef: Clone + Deref<Target = GeneralSAM<TransTable>>> Clone for GreedyTokenizer<TransTable, TokenIDType, SAMRef>
source§fn clone(&self) -> GreedyTokenizer<TransTable, TokenIDType, SAMRef>
fn clone(&self) -> GreedyTokenizer<TransTable, TokenIDType, SAMRef>
Returns a copy of the value. Read more
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source. Read moresource§impl<TransTable: Debug + TransitionTable, TokenIDType: Debug + Clone + Default + PartialEq, SAMRef: Debug + Deref<Target = GeneralSAM<TransTable>>> Debug for GreedyTokenizer<TransTable, TokenIDType, SAMRef>
impl<TransTable: Debug + TransitionTable, TokenIDType: Debug + Clone + Default + PartialEq, SAMRef: Debug + Deref<Target = GeneralSAM<TransTable>>> Debug for GreedyTokenizer<TransTable, TokenIDType, SAMRef>
Auto Trait Implementations§
impl<TransTable, TokenIDType, SAMRef> RefUnwindSafe for GreedyTokenizer<TransTable, TokenIDType, SAMRef>where SAMRef: RefUnwindSafe, TokenIDType: RefUnwindSafe,
impl<TransTable, TokenIDType, SAMRef> Send for GreedyTokenizer<TransTable, TokenIDType, SAMRef>where SAMRef: Send, TokenIDType: Send + Sync,
impl<TransTable, TokenIDType, SAMRef> Sync for GreedyTokenizer<TransTable, TokenIDType, SAMRef>where SAMRef: Sync, TokenIDType: Send + Sync,
impl<TransTable, TokenIDType, SAMRef> Unpin for GreedyTokenizer<TransTable, TokenIDType, SAMRef>where SAMRef: Unpin,
impl<TransTable, TokenIDType, SAMRef> UnwindSafe for GreedyTokenizer<TransTable, TokenIDType, SAMRef>where SAMRef: UnwindSafe, TokenIDType: RefUnwindSafe,
Blanket Implementations§
source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere T: ?Sized,
source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more