Module tokenize_sentence

Expand description

§Sentence tokenization module

For instance let’s tokenize the following text

use vtext::tokenize::Tokenizer;
use vtext::tokenize_sentence::*;

let s = "Here is one. Here is another? Bang!! This trailing text is one more";

Using the Unicode sentence tokenizer we would get,


let tokenizer = UnicodeSentenceTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!! ", "This trailing text is one more"]);

Here UnicodeSentenceTokenizerParams object is a thin wrapper around the unicode-segmentation crate.

Using the Punctuation sentence tokenizer we would get,


let tokenizer = PunctuationTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!", "! ", "This trailing text is one more"]);

Notice the “Bang!!” is treated differently.

You can easily customise the PunctuationTokenizer to work with other languages. For example,

use vtext::vecString;

let s = "বৃহত্তম ভাষা। বাংলা";
let punctuation = vecString!['।'];

let tokenizer = PunctuationTokenizerParams::default().punctuation(punctuation).build().unwrap();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["বৃহত্তম ভাষা। ", "বাংলা"]);

Refer to the test cases for further langauge examples.

Structs§

PunctuationTokenizer: Punctuation sentence tokenizer
PunctuationTokenizerParams: Builder for the punctuation sentence tokenizer
UnicodeSentenceTokenizer: Unicode sentence tokenizer
UnicodeSentenceTokenizerParams: Builder for the unicode segmentation tokenizer

Module tokenize_sentenceCopy item path

§Sentence tokenization module

Structs§

Module tokenize_sentence