[−][src]Module vtext::tokenize_sentence

Sentence tokenization module

For instance let's tokenize the following text

use vtext::tokenize::Tokenizer;
use vtext::tokenize_sentence::*;

let s = "Here is one. Here is another? Bang!! This trailing text is one more";

Using the Unicode sentence tokenizer we would get,


let tokenizer = UnicodeSentenceTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!! ", "This trailing text is one more"]);

Here UnicodeSentenceTokenizerParams object is a thin wrapper around the unicode-segmentation crate.

Using the Punctuation sentence tokenizer we would get,


let tokenizer = PunctuationTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!", "! ", "This trailing text is one more"]);

Notice the "Bang!!" is treated differently.

You can easily customise the PunctuationTokenizer to work with other languages. For example,

use vtext::vecString;

let s = "বৃহত্তম ভাষা। বাংলা";
let punctuation = vecString!['।'];

let tokenizer = PunctuationTokenizerParams::default().punctuation(punctuation).build().unwrap();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["বৃহত্তম ভাষা। ", "বাংলা"]);

Refer to the test cases for further langauge examples.

Structs

PunctuationTokenizer	Punctuation sentence tokenizer
PunctuationTokenizerParams	Builder for the punctuation sentence tokenizer
UnicodeSentenceTokenizer	Unicode sentence tokenizer
UnicodeSentenceTokenizerParams	Builder for the unicode segmentation tokenizer