Expand description
§Natural language alphabet detection library
§Detects 402 alphabets of 325 languages in 170 scripts
One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage
(language + script).
Does not have any models, just matches the alphabet. Not recommended to use as a standalone detector.
It’s more like a word separator + language prefilter for an actual language detector (Langram
).
Splits text (iterator CharIndices
) to words, and detects ScriptLanguage
s (language + script) of words by used letters (chars).
ISO 639-3
(using Language
) and ISO 15924
(using Script
)
are implemented, also combined using ScriptLanguage
.
§Examples
To split text
to the iterator of Word
:
use alphabet_detector::words;
let text = "test text";
let word_iter = words::from_ch_ind::<Vec<char>>(text.char_indices());
If you don’t need individual words, but just want to analyze a full text:
use alphabet_detector::fulltext_filter_with_margin_sorted;
let text = "test text";
let (all_words, all_langs, _) = fulltext_filter_with_margin_sorted::<Vec<char>, 95>(text.char_indices());
It will give you all Word
s (Vec<Word<Vec<char>>>
) of text
and Vec<(ScriptLanguage, u32)>
filtered with a less then 5% margin for an error.
Instead of Vec<char>
you can use other types of words.
Re-exports§
pub use ch_norm::CharData;
pub use ch_norm::CharNormalizingIterator;
pub use words::Word;
pub use words::WordIterator;
Modules§
Structs§
- Language
Iter - An iterator over the variants of [Language]
- Script
Iter - An iterator over the variants of [Script]
- Script
Language Iter - An iterator over the variants of [ScriptLanguage]
Enums§
- Language
- Int representation is unstable and can be changed anytime.
Code representation (const
into_code
/from_code
) or string representation (constinto_str
/from_str
) are more stable. - Script
- Has aliases in comparison to
UcdScript
. Int representation is unstable and can be changed anytime. Code representation (constinto_code
/from_code
) or string representation (constinto_str
/from_str
) are more stable. - Script
Language - Language + script. Ordered by total speakers.
Value-names not always represent a script used, so a “default” script can be changed.
Int representation is unstable and can be changed anytime.
Parts representation (const
into_parts
/from_parts
) or code representation (constinto_code
/from_code
) or string representation (constinto_str
/from_str
) are more stable. - UcdScript
- Int representation is unstable and can be changed anytime.
Code representation (const
into_code
/from_code
) or string representation (constinto_str
/from_str
) are more stable.
Functions§
- filter_
max - Only top
ScriptLanguage
s are retained. - filter_
with_ margin - Only top (100 -
PERCENT
)%ScriptLanguage
s are retained. - filter_
with_ margin_ sorted - Only top (100 -
PERCENT
)%ScriptLanguage
s are retained, then sorted. - fulltext
- All words detection summed up.
- fulltext_
filter_ max - All words detection summed up, then filtered by max (
filter_max
). - fulltext_
filter_ with_ margin - All words detection summed up, then filtered with margin percent
(
filter_with_margin
). - fulltext_
filter_ with_ margin_ sorted - All words detection summed up, then filtered with margin percent
(
filter_with_margin_sorted
), then sorted. - script_
char_ to_ slangs - Returns all
ScriptLanguage
s byUcdScript
andchar
- slang_
arr_ default - slang_
arr_ default_ nc - slangs_
count_ max