[][src]Module vtext::tokenize

Tokenization module

This module includes several tokenizers

For instance let's tokenize the following sentence,

use vtext::tokenize::*;

let s = "The “brown” fox can't jump 32.3 feet, right?";

Using a regular expression tokenizer we would get,

let tokenizer = RegexpTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["The", "brown", "fox", "can", "jump", "32", "feet", "right"]);

which would remove all punctuation. A more general approach is to apply unicode segmentation,

let tokenizer = UnicodeWordTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["The", "“", "brown", "”", "fox", "can't", "jump", "32.3", "feet", ",", "right", "?"]);

Here UnicodeWordTokenizer object is a thin wrapper around the unicode-segmentation crate.

This approach produces better results, however for instance the word "can't" should be tokenized as "ca", "n't" in English. To address such issues, we apply several additional rules on the previous results,

let tokenizer = VTextTokenizerParams::default().lang("en").build().unwrap();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["The", "“", "brown", "”", "fox", "ca", "n't", "jump", "32.3", "feet", ",", "right", "?"]);

Structs

CharacterTokenizer

Character tokenizer

CharacterTokenizerParams
RegexpTokenizer

Regular expression tokenizer

RegexpTokenizerParams

Builder for the regexp tokenizer

UnicodeWordTokenizer

Unicode Segmentation tokenizer

UnicodeWordTokenizerParams

Builder for the unicode segmentation tokenizer

VTextTokenizer

vtext tokenizer

VTextTokenizerParams

Builder for the VTextTokenizer

Traits

Tokenizer