[−][src]Module vtext::tokenize

Tokenization module

This module includes several tokenizers

For instance let's tokenize the following sentence,

use vtext::tokenize::*;

let s = "The “brown” fox can't jump 32.3 feet, right?";

Using a regular expression tokenizer we would get,

let tokenizer = RegexpTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["The", "brown", "fox", "can", "jump", "32", "feet", "right"]);

which would remove all punctuation. A more general approach is to apply unicode segmentation,

let tokenizer = UnicodeWordTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["The", "“", "brown", "”", "fox", "can't", "jump", "32.3", "feet", ",", "right", "?"]);

Here UnicodeWordTokenizer object is a thin wrapper around the unicode-segmentation crate.

This approach produces better results, however for instance the word "can't" should be tokenized as "ca", "n't" in English. To address such issues, we apply several additional rules on the previous results,

let tokenizer = VTextTokenizerParams::default().lang("en").build().unwrap();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["The", "“", "brown", "”", "fox", "ca", "n't", "jump", "32.3", "feet", ",", "right", "?"]);

Structs

CharacterTokenizer	Character tokenizer
CharacterTokenizerParams
RegexpTokenizer	Regular expression tokenizer
RegexpTokenizerParams	Builder for the regexp tokenizer
UnicodeWordTokenizer	Unicode Segmentation tokenizer
UnicodeWordTokenizerParams	Builder for the unicode segmentation tokenizer
VTextTokenizer	vtext tokenizer
VTextTokenizerParams	Builder for the VTextTokenizer

Traits

Tokenizer