Crate langram

Source
Expand description

§Natural language detection library

§308 ScriptLanguages (187 models + 121 single language scripts)

One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage (language + script).

ISO 639-3 (using Language) and ISO 15924 (using Script) are implemented, also combined using ScriptLanguage.

§Example

use langram::*;

let models_storage = ModelsStorage::default();
let detector = DetectorBuilder::new(&models_storage).build();
// preload models for faster detection
detector.preload_models();

// single thread
let text = "text";
let result = detector.detect_top_one(text, 0.0);

// or multithreaded (rayon for example)
use rayon::iter::IntoParallelRefIterator;
use rayon::iter::ParallelIterator;

let texts = &["text1", "text2"];
let results: Vec<_> = texts
    .par_iter()
    .map(|text| detector.detect_top_one(text, 0.0))
    .collect();

detector also has other methods

Structs§

Detector
DetectorBuilder
Fraction
ModelsStorage
With all models preloaded uses around 4.1GB of RAM.

Enums§

Language
Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
NgramSize
Script
Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
ScriptLanguage
Language + script. Value-names not always represent a script used, so a “default” script can be changed. Int representation is unstable and can be changed anytime. Parts representation (const into_parts/from_parts) or code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
UcdScript
Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.

Type Aliases§

FileModel