[][src]Module vtext::tokenize_sentence

Sentence tokenization module

For instance let's tokenize the following text

use vtext::tokenize::Tokenizer;
use vtext::tokenize_sentence::*;

let s = "Here is one. Here is another? Bang!! This trailing text is one more";

Using the Unicode sentence tokenizer we would get,


let tokenizer = UnicodeSentenceTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!! ", "This trailing text is one more"]);

Here UnicodeSentenceTokenizerParams object is a thin wrapper around the unicode-segmentation crate.

Using the Punctuation sentence tokenizer we would get,


let tokenizer = PunctuationTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!", "! ", "This trailing text is one more"]);

Notice the "Bang!!" is treated differently.

You can easily customise the PunctuationTokenizer to work with other languages. For example,

use vtext::vecString;

let s = "বৃহত্তম ভাষা। বাংলা";
let punctuation = vecString!['।'];

let tokenizer = PunctuationTokenizerParams::default().punctuation(punctuation).build().unwrap();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["বৃহত্তম ভাষা। ", "বাংলা"]);

Refer to the test cases for further langauge examples.

Structs

PunctuationTokenizer

Punctuation sentence tokenizer

PunctuationTokenizerParams

Builder for the punctuation sentence tokenizer

UnicodeSentenceTokenizer

Unicode sentence tokenizer

UnicodeSentenceTokenizerParams

Builder for the unicode segmentation tokenizer