Expand description
§Natural language alphabet detection library
§Detects 429 alphabets of 347 languages in 174 scripts
One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage (language + script).
Does not have any models, just matches the alphabet. Not recommended to use as a standalone detector.
It’s more like a word separator + language prefilter for an actual language detector (Langram).
Splits text (iterator CharIndices) to words, and detects ScriptLanguages (language + script) of words by used letters (chars).
ISO 639-3 (using Language) and ISO 15924 (using Script)
are implemented, also combined using ScriptLanguage.
§Examples
To split text to the iterator of Word:
use alphabet_detector::words;
let text = "test text";
let word_iter = words::from_ch_ind::<Vec<char>>(text.char_indices());If you don’t need individual words, but just want to analyze a full text:
use alphabet_detector::fulltext_filter_with_margin_sorted;
let text = "test text";
let (all_words, all_langs, _) = fulltext_filter_with_margin_sorted::<Vec<char>, 95>(text.char_indices());It will give you all Words (Vec<Word<Vec<char>>>) of text and Vec<(ScriptLanguage, u32)> filtered with a less then 5% margin for an error.
Instead of Vec<char> you can use other types of words.
Re-exports§
pub use ch_norm::CharData;pub use ch_norm::CharNormalizingIterator;pub use words::Word;pub use words::WordIterator;
Modules§
Structs§
- Language
Iter - An iterator over the variants of [Language]
- Script
Iter - An iterator over the variants of [Script]
- Script
Language Iter - An iterator over the variants of [ScriptLanguage]
Enums§
- Language
- Int representation is unstable and can be changed anytime.
Code representation (const
into_code/from_code) or string representation (constinto_str/from_str) are more stable. - Script
- Has aliases in comparison to
UcdScript. Int representation is unstable and can be changed anytime. Code representation (constinto_code/from_code) or string representation (constinto_str/from_str) are more stable. - Script
Language - Language + script. Ordered by total speakers.
Value-names not always represent a script used, so a “default” script can be changed.
Int representation is unstable and can be changed anytime.
Parts representation (const
into_parts/from_parts) or code representation (constinto_code/from_code) or string representation (constinto_str/from_str) are more stable. - UcdScript
- Int representation is unstable and can be changed anytime.
Code representation (const
into_code/from_code) or string representation (constinto_str/from_str) are more stable.
Traits§
- Enum
Count - A trait for capturing the number of variants in Enum. This trait can be autoderived by
strum_macros. - Into
Enum Iterator - This trait designates that an
Enumcan be iterated over. It can be auto generated using theEnumIterderive macro.
Functions§
- filter_
max - Only top
ScriptLanguages are retained. - filter_
with_ margin - Only top (100 -
PERCENT)%ScriptLanguages are retained. - filter_
with_ margin_ sorted - Only top (100 -
PERCENT)%ScriptLanguages are retained, then sorted. - fulltext
- All words detection summed up.
- fulltext_
filter_ max - All words detection summed up, then filtered by max (
filter_max). - fulltext_
filter_ with_ margin - All words detection summed up, then filtered with margin percent
(
filter_with_margin). - fulltext_
filter_ with_ margin_ sorted - All words detection summed up, then filtered with margin percent
(
filter_with_margin_sorted), then sorted. - script_
char_ to_ slangs - Returns all
ScriptLanguages byUcdScriptandchar - slang_
arr_ default - slang_
arr_ default_ nc - slangs_
count_ max