Crate alphabet_detector

Source
Expand description

§Natural language alphabet detection library

§Detects 402 alphabets of 325 languages in 170 scripts

One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage (language + script).

Does not have any models, just matches the alphabet. Not recommended to use as a standalone detector. It’s more like a word separator + language prefilter for an actual language detector (Langram).

Splits text (iterator CharIndices) to words, and detects ScriptLanguages (language + script) of words by used letters (chars).

ISO 639-3 (using Language) and ISO 15924 (using Script) are implemented, also combined using ScriptLanguage.

§Examples

To split text to the iterator of Word:

use alphabet_detector::words;

let text = "test text";
let word_iter = words::from_ch_ind::<Vec<char>>(text.char_indices());

If you don’t need individual words, but just want to analyze a full text:

use alphabet_detector::fulltext_filter_with_margin_sorted;

let text = "test text";
let (all_words, all_langs, _) = fulltext_filter_with_margin_sorted::<Vec<char>, 95>(text.char_indices());

It will give you all Words (Vec<Word<Vec<char>>>) of text and Vec<(ScriptLanguage, u32)> filtered with a less then 5% margin for an error.

Instead of Vec<char> you can use other types of words.

Re-exports§

pub use ch_norm::CharData;
pub use ch_norm::CharNormalizingIterator;
pub use words::Word;
pub use words::WordIterator;

Modules§

ch_norm
reader
ucd
words

Structs§

LanguageIter
An iterator over the variants of [Language]
ScriptIter
An iterator over the variants of [Script]
ScriptLanguageIter
An iterator over the variants of [ScriptLanguage]

Enums§

Language
Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
Script
Has aliases in comparison to UcdScript. Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
ScriptLanguage
Language + script. Ordered by total speakers. Value-names not always represent a script used, so a “default” script can be changed. Int representation is unstable and can be changed anytime. Parts representation (const into_parts/from_parts) or code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
UcdScript
Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.

Functions§

filter_max
Only top ScriptLanguages are retained.
filter_with_margin
Only top (100 - PERCENT)% ScriptLanguages are retained.
filter_with_margin_sorted
Only top (100 - PERCENT)% ScriptLanguages are retained, then sorted.
fulltext
All words detection summed up.
fulltext_filter_max
All words detection summed up, then filtered by max (filter_max).
fulltext_filter_with_margin
All words detection summed up, then filtered with margin percent (filter_with_margin).
fulltext_filter_with_margin_sorted
All words detection summed up, then filtered with margin percent (filter_with_margin_sorted), then sorted.
script_char_to_slangs
Returns all ScriptLanguages by UcdScript and char
slang_arr_default
slang_arr_default_nc
slangs_count_max

Type Aliases§

ScriptLanguageArr