Module tokenize_sentence

Source
Expand description

§Sentence tokenization module

For instance let’s tokenize the following text

use vtext::tokenize::Tokenizer;
use vtext::tokenize_sentence::*;

let s = "Here is one. Here is another? Bang!! This trailing text is one more";

Using the Unicode sentence tokenizer we would get,


let tokenizer = UnicodeSentenceTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!! ", "This trailing text is one more"]);

Here UnicodeSentenceTokenizerParams object is a thin wrapper around the unicode-segmentation crate.

Using the Punctuation sentence tokenizer we would get,


let tokenizer = PunctuationTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!", "! ", "This trailing text is one more"]);

Notice the “Bang!!” is treated differently.

You can easily customise the PunctuationTokenizer to work with other languages. For example,

use vtext::vecString;

let s = "বৃহত্তম ভাষা। বাংলা";
let punctuation = vecString!['।'];

let tokenizer = PunctuationTokenizerParams::default().punctuation(punctuation).build().unwrap();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["বৃহত্তম ভাষা। ", "বাংলা"]);

Refer to the test cases for further langauge examples.

Structs§

PunctuationTokenizer
Punctuation sentence tokenizer
PunctuationTokenizerParams
Builder for the punctuation sentence tokenizer
UnicodeSentenceTokenizer
Unicode sentence tokenizer
UnicodeSentenceTokenizerParams
Builder for the unicode segmentation tokenizer