language-tokenizer
Overview
language-tokenizer is a convenience wrapper around various Unicode and Natural Language Processing libraries used for analyzing, segmenting and tokenizing text.
The main purpose of this library is to tokenize text and use it in matching.
For processing Indo-European languages as well as Arabic, Indonesian etc., it uses custom normalization algorithm combined with battle-tested Snowball stemmer.
For processing CJK languages it uses lindera crate, which provides multiple dictionaries for Chinese, Japanese and Korean or ICU dictionary segmentation.
For processing Southeast Asian languages, it uses ICU LSTM segmentation.
Example
Tokenizing text is simple as:
use ;
let text = "that's someone who can rizz just like a skibidi! zoomer slang rocks, 67";
let tokens = tokenize.unwrap;
assert_eq!
Matching text is also built-in.
use ;
let haystack = "that's someone who can rizz just like a skibidi! zoomer slang rocks, 67";
let needle = "like a skibidi";
let haystack = tokenize.unwrap;
let needle = tokenize.unwrap;
assert!;
Features
No tokenizer is available by default, you should opt-in everything manually with features.
-
snowball- Enables tokenization for all languages supported by Snowball. -
japanese-ipadic-neologd-lindera- Enables tokenization for Japanese withipadic-neologddictionary. Slow compilation, if you don't need such quality this dictionary provides - consider usingipadic/unidicor even ICU. -
japanese-ipadic-lindera- Enables tokenization for Japanese withipadic-neologddictionary. Slow compilation, if you don't need such quality this dictionary provides - consider using ICU. -
japanese-unidic-lindera- Enables tokenization for Japanese withipadic-neologddictionary. Slow compilation, if you don't need such quality this dictionary provides - consider using ICU. -
chinese-lindera- Enables tokenization for Chinese withcc-cedictdictionary. Slow compilation, if you don't need such quality this dictionary provides - consider using ICU. -
korean-lindera- Enables tokenization for Chinese withko-dicdictionary. Slow compilation, if you don't need such quality this dictionary provides - consider using ICU. -
japanese-icu- Enables tokenization for Japanese using ICU dictionary. -
chinese-icu- Enables tokenization for Chinese using ICU dictionary. -
southeast-asian- Enables tokenization for Southeast Asian languages, such as Burmese, Khmer, Lao, and Thai using LSTM. -
full- Shorthand forsnowball,japanese-ipadic-neologd-lindera,chinese-lindera,korean-lindera,southeast-asianfeatures. -
serde- Some serialization/deserialization for types.
License
Project is licensed under WTFPL.