pub struct GreedyTokenizer<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq, SamRef: Deref<Target = GeneralSam<TransTable>>> { /* private fields */ }
Expand description
Greedy tokenizer with a general suffix automaton of the vocabulary.
Assuming that the input length is $n$, the maximum word length is $l$, and querying transitions in the trie takes $\mathcal{O}\left(\log{\Sigma}\right)$ time, then the overall time complexity of this implementation is $\mathcal{O}\left( n \cdot \left( \log{l} + \log{\Sigma} \right) \right)$.
The main optimization is to store suffix-wise information with persistent ropes. For each suffix in a state of the suffix automaton, the longest word matching the prefix of the suffix is stored in the rope. And the information stored in a state will be further merged in the ropes of its successors.
Implementations§
Source§impl<TransTable: TransitionTable, SamRef: Deref<Target = GeneralSam<TransTable>>> GreedyTokenizer<TransTable, TrieNodeID, SamRef>
impl<TransTable: TransitionTable, SamRef: Deref<Target = GeneralSam<TransTable>>> GreedyTokenizer<TransTable, TrieNodeID, SamRef>
pub fn build_from_trie<TT: TransitionTable<KeyType = TransTable::KeyType>>( sam: SamRef, trie_state: TrieState<TT, &Trie<TT>>, ) -> Self
Source§impl<TransTable: TransitionTable> GreedyTokenizer<TransTable, TrieNodeID, OwnedGeneralSam<TransTable>>
impl<TransTable: TransitionTable> GreedyTokenizer<TransTable, TrieNodeID, OwnedGeneralSam<TransTable>>
pub fn build_from_sam_and_trie<TT: TransitionTable<KeyType = TransTable::KeyType>>( sam: GeneralSam<TransTable>, trie_state: TrieState<TT, &Trie<TT>>, ) -> Self
Source§impl<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq> GreedyTokenizer<TransTable, TokenIDType, OwnedGeneralSam<TransTable>>
impl<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq> GreedyTokenizer<TransTable, TokenIDType, OwnedGeneralSam<TransTable>>
pub fn build_from_sam<TN: TrieNodeAlike<InnerType = TransTable::KeyType>, F: FnMut(&TN) -> TokenIDType>( sam: GeneralSam<TransTable>, trie_node: TN, f: F, ) -> Self
Source§impl<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq, SamRef: Deref<Target = GeneralSam<TransTable>>> GreedyTokenizer<TransTable, TokenIDType, SamRef>
impl<TransTable: TransitionTable, TokenIDType: Clone + Default + PartialEq, SamRef: Deref<Target = GeneralSam<TransTable>>> GreedyTokenizer<TransTable, TokenIDType, SamRef>
pub fn get_sam(&self) -> &SamRef
pub fn get_sam_ref(&self) -> &GeneralSam<TransTable>
pub fn get_suffix_data(&self) -> &Vec<SuffixInTrieData<TokenIDType>> ⓘ
pub fn inner_as_ref( &self, ) -> GreedyTokenizer<TransTable, TokenIDType, &GeneralSam<TransTable>>
pub fn build<TN: TrieNodeAlike<InnerType = TransTable::KeyType>, F: FnMut(&TN) -> TokenIDType>( sam: SamRef, trie_node: TN, f: F, ) -> Self
pub fn tokenize<Iter: IntoIterator<Item = TransTable::KeyType>>( &self, iter: Iter, unk_token_id: &TokenIDType, ) -> Vec<(TokenIDType, usize)>
Trait Implementations§
Source§impl<TransTable: Clone + TransitionTable, TokenIDType: Clone + Clone + Default + PartialEq, SamRef: Clone + Deref<Target = GeneralSam<TransTable>>> Clone for GreedyTokenizer<TransTable, TokenIDType, SamRef>
impl<TransTable: Clone + TransitionTable, TokenIDType: Clone + Clone + Default + PartialEq, SamRef: Clone + Deref<Target = GeneralSam<TransTable>>> Clone for GreedyTokenizer<TransTable, TokenIDType, SamRef>
Source§fn clone(&self) -> GreedyTokenizer<TransTable, TokenIDType, SamRef>
fn clone(&self) -> GreedyTokenizer<TransTable, TokenIDType, SamRef>
Returns a duplicate of the value. Read more
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from
source
. Read moreSource§impl<TransTable: Debug + TransitionTable, TokenIDType: Debug + Clone + Default + PartialEq, SamRef: Debug + Deref<Target = GeneralSam<TransTable>>> Debug for GreedyTokenizer<TransTable, TokenIDType, SamRef>
impl<TransTable: Debug + TransitionTable, TokenIDType: Debug + Clone + Default + PartialEq, SamRef: Debug + Deref<Target = GeneralSam<TransTable>>> Debug for GreedyTokenizer<TransTable, TokenIDType, SamRef>
Auto Trait Implementations§
impl<TransTable, TokenIDType, SamRef> Freeze for GreedyTokenizer<TransTable, TokenIDType, SamRef>where
SamRef: Freeze,
impl<TransTable, TokenIDType, SamRef> RefUnwindSafe for GreedyTokenizer<TransTable, TokenIDType, SamRef>where
SamRef: RefUnwindSafe,
TokenIDType: RefUnwindSafe,
impl<TransTable, TokenIDType, SamRef> Send for GreedyTokenizer<TransTable, TokenIDType, SamRef>
impl<TransTable, TokenIDType, SamRef> Sync for GreedyTokenizer<TransTable, TokenIDType, SamRef>
impl<TransTable, TokenIDType, SamRef> Unpin for GreedyTokenizer<TransTable, TokenIDType, SamRef>where
SamRef: Unpin,
impl<TransTable, TokenIDType, SamRef> UnwindSafe for GreedyTokenizer<TransTable, TokenIDType, SamRef>where
SamRef: UnwindSafe,
TokenIDType: RefUnwindSafe,
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more