Trait tokenizers::tokenizer::PreTokenizer

source ·
pub trait PreTokenizer {
    // Required method
    fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>;
}
Expand description

The PreTokenizer is in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from the NormalizedString. In some occasions, the PreTokenizer might need to modify the given NormalizedString to ensure we can entirely keep track of the offsets and the mapping with the original string.

Required Methods§

source

fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>

Implementors§

source§

impl PreTokenizer for PreTokenizerWrapper

source§

impl PreTokenizer for BertPreTokenizer

source§

impl PreTokenizer for ByteLevel

As a PreTokenizer, ByteLevel is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.

source§

impl PreTokenizer for CharDelimiterSplit

source§

impl PreTokenizer for Digits

source§

impl PreTokenizer for Metaspace

source§

impl PreTokenizer for Punctuation

source§

impl PreTokenizer for Sequence

source§

impl PreTokenizer for Split

source§

impl PreTokenizer for UnicodeScripts

source§

impl PreTokenizer for Whitespace

source§

impl PreTokenizer for WhitespaceSplit