[−][src]Module vtext::tokenize_sentence
Sentence tokenization module
For instance let's tokenize the following text
use vtext::tokenize::Tokenizer; use vtext::tokenize_sentence::*; let s = "Here is one. Here is another? Bang!! This trailing text is one more";
Using the Unicode sentence tokenizer we would get,
let tokenizer = UnicodeSentenceTokenizer::default(); let tokens: Vec<&str> = tokenizer.tokenize(s).collect(); assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!! ", "This trailing text is one more"]);
Here UnicodeSentenceTokenizerParams
object is a thin wrapper around the
unicode-segmentation crate.
Using the Punctuation sentence tokenizer we would get,
let tokenizer = PunctuationTokenizer::default(); let tokens: Vec<&str> = tokenizer.tokenize(s).collect(); assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!", "! ", "This trailing text is one more"]);
Notice the "Bang!!" is treated differently.
You can easily customise the PunctuationTokenizer
to work with other languages. For example,
use vtext::vecString; let s = "বৃহত্তম ভাষা। বাংলা"; let punctuation = vecString!['।']; let tokenizer = PunctuationTokenizerParams::default().punctuation(punctuation).build().unwrap(); let tokens: Vec<&str> = tokenizer.tokenize(s).collect(); assert_eq!(tokens, &["বৃহত্তম ভাষা। ", "বাংলা"]);
Refer to the test cases for further langauge examples.
Structs
PunctuationTokenizer | Punctuation sentence tokenizer |
PunctuationTokenizerParams | Builder for the punctuation sentence tokenizer |
UnicodeSentenceTokenizer | Unicode sentence tokenizer |
UnicodeSentenceTokenizerParams | Builder for the unicode segmentation tokenizer |