[][src]Trait tokenizers::tokenizer::PreTokenizer

pub trait PreTokenizer: Send + Sync {
    fn pre_tokenize(
        &self,
        normalized: &mut NormalizedString
    ) -> Result<Vec<(String, Offsets)>>; }

The PreTokenizer is in charge of doing the pre-segmentation step. It splits the given string in multiple substrings, keeping track of the offsets of said substrings from the NormalizedString. In some occasions, the PreTokenizer might need to modify the given NormalizedString to ensure we can entirely keep track of the offsets and the mapping with the original string.

Required methods

fn pre_tokenize(
    &self,
    normalized: &mut NormalizedString
) -> Result<Vec<(String, Offsets)>>

Loading content...

Implementors

impl PreTokenizer for BertPreTokenizer[src]

impl PreTokenizer for ByteLevel[src]

As a PreTokenizer, ByteLevel is in charge of transforming all the unicode characters into their byte-level counterpart. It also splits the input according to the configured regex.

impl PreTokenizer for CharDelimiterSplit[src]

impl PreTokenizer for Metaspace[src]

impl PreTokenizer for Whitespace[src]

impl PreTokenizer for WhitespaceSplit[src]

Loading content...