Skip to main content

ITokenizer

Trait ITokenizer 

Source
pub trait ITokenizer: Send {
    // Required methods
    fn id(&self) -> &str;
    fn encode(&self, text: &str) -> Vec<u32>;
}
Expand description

Common interface every tokenizer implementation satisfies.

Implemented by BPETokenizer and crate::longest_match::LongestMatchTokenizer.

The trait deliberately does not require SyncBPETokenizer keeps a RefCell-backed encode cache (mirroring the .NET Dictionary). Wrap in Mutex for cross-thread sharing.

Required Methods§

Source

fn id(&self) -> &str

Identifier of the underlying vocabulary.

Source

fn encode(&self, text: &str) -> Vec<u32>

Encode a string to a sequence of token IDs.

Implementors§