[−][src]Module vtext::tokenize
Tokenization module
This module includes several tokenizers
For instance let's tokenize the following sentence,
use vtext::tokenize::*; let s = "The “brown” fox can't jump 32.3 feet, right?";
Using a regular expression tokenizer we would get,
let tokenizer = RegexpTokenizer::default(); let tokens: Vec<&str> = tokenizer.tokenize(s).collect(); assert_eq!(tokens, &["The", "brown", "fox", "can", "jump", "32", "feet", "right"]);
which would remove all punctuation. A more general approach is to apply unicode segmentation,
let tokenizer = UnicodeWordTokenizer::default(); let tokens: Vec<&str> = tokenizer.tokenize(s).collect(); assert_eq!(tokens, &["The", "“", "brown", "”", "fox", "can't", "jump", "32.3", "feet", ",", "right", "?"]);
Here UnicodeWordTokenizer
object is a thin wrapper around the
unicode-segmentation crate.
This approach produces better results, however for instance the word "can't" should be tokenized as "ca", "n't" in English. To address such issues, we apply several additional rules on the previous results,
let tokenizer = VTextTokenizerParams::default().lang("en").build().unwrap(); let tokens: Vec<&str> = tokenizer.tokenize(s).collect(); assert_eq!(tokens, &["The", "“", "brown", "”", "fox", "ca", "n't", "jump", "32.3", "feet", ",", "right", "?"]);
Structs
CharacterTokenizer | Character tokenizer |
CharacterTokenizerParams | |
RegexpTokenizer | Regular expression tokenizer |
RegexpTokenizerParams | Builder for the regexp tokenizer |
UnicodeWordTokenizer | Unicode Segmentation tokenizer |
UnicodeWordTokenizerParams | Builder for the unicode segmentation tokenizer |
VTextTokenizer | vtext tokenizer |
VTextTokenizerParams | Builder for the VTextTokenizer |
Traits
Tokenizer |