Charabia library tokenize a text detecting the Script/Language, segmenting, normalizing, and classifying it.
Examples
Tokenization
use Tokenize;
let orig = "Thé quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";
// tokenize the text.
let mut tokens = orig.tokenize;
let token = tokens.next.unwrap;
// the lemma into the token is normalized: `Thé` became `the`.
assert_eq!;
// token is classfied as a word
assert!;
let token = tokens.next.unwrap;
assert_eq!;
// token is classfied as a separator
assert!;
Segmentation
use Segment;
let orig = "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F!";
let mut segments = orig.segment_str;
assert_eq!;
assert_eq!;
assert_eq!;
Build features
Charabia comes with default features that can be deactivated at compile time,
this features are additional Language supports that need to download and/or build a specialized dictionary that impact the compilation time.
Theses features are listed in charabia's cargo.toml and can be deactivated via dependency features.