Expand description
§Natural language detection library
§321 ScriptLanguages (187 models + 134 single language scripts)
One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage (language + script).
ISO 639-3 (using Language) and ISO 15924 (using Script)
are implemented, also combined using ScriptLanguage.
§Setup
To use this library, you need a binary models file, which must be placed near the executable, or set LANGRAM_MODELS_PATH.
It can be:
-
Downloaded from langram_models releases;
-
Built (recommened if big-endian target) langram_models. Which is more advanced and allows you to remove model ngrams, and recompile, so that models binary would be lighter.
§Example
use langram::{DetectorBuilder, ModelsStorage};
let models_storage = ModelsStorage::new().unwrap();
let detector = DetectorBuilder::new(&models_storage).build();
// single thread
let text = "text";
let result = detector.detect_top_one_reordered(text);
// or multithreaded (rayon for example)
use rayon::iter::IntoParallelRefIterator;
use rayon::iter::ParallelIterator;
let texts = &["text1", "text2"];
let results: Vec<_> = texts
.par_iter()
.map(|text| detector.detect_top_one_reordered(text))
.collect();detector also has other methods
Re-exports§
pub use ngram_size::NgramSize;
Modules§
Macros§
Structs§
Enums§
- Language
- Int representation is unstable and can be changed anytime.
Code representation (const
into_code/from_code) or string representation (constinto_str/from_str) are more stable. - Models
Storage Error - Script
- Has aliases in comparison to
UcdScript. Int representation is unstable and can be changed anytime. Code representation (constinto_code/from_code) or string representation (constinto_str/from_str) are more stable. - Script
Language - Language + script. Ordered by total speakers.
Value-names not always represent a script used, so a “default” script can be changed.
Int representation is unstable and can be changed anytime.
Parts representation (const
into_parts/from_parts) or code representation (constinto_code/from_code) or string representation (constinto_str/from_str) are more stable. - UcdScript
- Int representation is unstable and can be changed anytime.
Code representation (const
into_code/from_code) or string representation (constinto_str/from_str) are more stable.