Trait tokenizers::tokenizer::PreTokenizer

source ·

pub trait PreTokenizer {
    // Required method
    fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>;
}

Expand description

The PreTokenizer is in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from the NormalizedString. In some occasions, the PreTokenizer might need to modify the given NormalizedString to ensure we can entirely keep track of the offsets and the mapping with the original string.

Required Methods§

source

fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>

Implementors§

source §

impl PreTokenizer for PreTokenizerWrapper

source §

impl PreTokenizer for BertPreTokenizer

source §

impl PreTokenizer for ByteLevel

As a PreTokenizer, ByteLevel is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.

source §

Trait tokenizers::tokenizer::PreTokenizerCopy item path

Required Methods§

fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()>

Implementors§

impl PreTokenizer for PreTokenizerWrapper

impl PreTokenizer for BertPreTokenizer

impl PreTokenizer for ByteLevel

impl PreTokenizer for CharDelimiterSplit

impl PreTokenizer for Digits

impl PreTokenizer for Metaspace

impl PreTokenizer for Punctuation

impl PreTokenizer for Sequence

impl PreTokenizer for Split

impl PreTokenizer for UnicodeScripts

impl PreTokenizer for Whitespace

impl PreTokenizer for WhitespaceSplit

Trait tokenizers::tokenizer::PreTokenizer