Crate langram

Crate langram 

Source
Expand description

§Natural language detection library

§321 ScriptLanguages (187 models + 134 single language scripts)

One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage (language + script).

ISO 639-3 (using Language) and ISO 15924 (using Script) are implemented, also combined using ScriptLanguage.

§Setup

To use this library, you need a binary models file, which must be placed near the executable, or set LANGRAM_MODELS_PATH.

It can be:

  • Downloaded from langram_models releases;

  • Built (recommened if big-endian target) langram_models. Which is more advanced and allows you to remove model ngrams, and recompile, so that models binary would be lighter.

§Example

use langram::{DetectorBuilder, ModelsStorage};

let models_storage = ModelsStorage::new().unwrap();
let detector = DetectorBuilder::new(&models_storage).build();

// single thread
let text = "text";
let result = detector.detect_top_one_reordered(text);

// or multithreaded (rayon for example)
use rayon::iter::IntoParallelRefIterator;
use rayon::iter::ParallelIterator;

let texts = &["text1", "text2"];
let results: Vec<_> = texts
    .par_iter()
    .map(|text| detector.detect_top_one_reordered(text))
    .collect();

detector also has other methods

Re-exports§

pub use ngram_size::NgramSize;

Modules§

bin_storage
model
ngram_size

Macros§

ahashset

Structs§

Detector
DetectorBuilder
ModelsStorage

Enums§

Language
Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
ModelsStorageError
Script
Has aliases in comparison to UcdScript. Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
ScriptLanguage
Language + script. Ordered by total speakers. Value-names not always represent a script used, so a “default” script can be changed. Int representation is unstable and can be changed anytime. Parts representation (const into_parts/from_parts) or code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.
UcdScript
Int representation is unstable and can be changed anytime. Code representation (const into_code/from_code) or string representation (const into_str/from_str) are more stable.