Expand description
§Sentence tokenization module
For instance let’s tokenize the following text
use vtext::tokenize::Tokenizer;
use vtext::tokenize_sentence::*;
let s = "Here is one. Here is another? Bang!! This trailing text is one more";Using the Unicode sentence tokenizer we would get,
let tokenizer = UnicodeSentenceTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!! ", "This trailing text is one more"]);Here UnicodeSentenceTokenizerParams object is a thin wrapper around the
unicode-segmentation crate.
Using the Punctuation sentence tokenizer we would get,
let tokenizer = PunctuationTokenizer::default();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["Here is one. ", "Here is another? ", "Bang!", "! ", "This trailing text is one more"]);Notice the “Bang!!” is treated differently.
You can easily customise the PunctuationTokenizer to work with other languages. For example,
use vtext::vecString;
let s = "বৃহত্তম ভাষা। বাংলা";
let punctuation = vecString!['।'];
let tokenizer = PunctuationTokenizerParams::default().punctuation(punctuation).build().unwrap();
let tokens: Vec<&str> = tokenizer.tokenize(s).collect();
assert_eq!(tokens, &["বৃহত্তম ভাষা। ", "বাংলা"]);Refer to the test cases for further langauge examples.
Structs§
- Punctuation
Tokenizer - Punctuation sentence tokenizer
- Punctuation
Tokenizer Params - Builder for the punctuation sentence tokenizer
- Unicode
Sentence Tokenizer - Unicode sentence tokenizer
- Unicode
Sentence Tokenizer Params - Builder for the unicode segmentation tokenizer