whatlang 0.4.0

Natural language detection library. Identifies language of a given text.

Whatlang

Build Status License Documentation

Natural language detection for Rust with focus on simplicity and performance.

Features

  • Supports 84 languages
  • 100% written in Rust
  • Lightweight, fast and simple
  • Recognizes not only a language, but also a script (Latin, Cyrillic, etc)
  • Provides reliability information
  • No external dependencies (apart from fnv hasher, that gives 30% boost)

Get started

Add to you Cargo.toml:

[dependencies]

whatlang = "0.4.0"

Example:

use whatlang::{detect, Lang, Script};

let text = "Ĉu vi ne volas eklerni Esperanton? Bonvolu! Estas unu de la plej bonaj aferoj!";
let info = detect(text).unwrap();
assert_eq!(info.lang(), Lang::Epo);
assert_eq!(info.script(), Script::Latin);
assert!(info.is_reliable());

For more details (e.g. how to blacklist some languages) please check the documentation.

How does it work?

How language recognition works?

The algorithm is based on the trigram language models, which is a particular case of n-grams. To understand the idea, please check the original whitepaper Cavnar and Trenkle '94: N-Gram-Based Text Categorization'.

How is_reliable calculated?

It is based on the following factors:

  • How many unique trigrams are in the given text
  • How big is the difference between the first and the second(not returned) detected languages? This metric is called rate in the code base.

Therefore, it can be presented as 2d space with with threshold functions, that splits it into "Reliable" and "Not reliable" areas. This function is a hyperbola and it looks like the following one:

Whatlang is reliable

Running benchmarks

This is mostly useful to test performance optimizations.

cargo bench

Ports and clones

Derivation

Whatlang is a derivative work from Franc (JavaScript, MIT) by Titus Wormer.

License

MIT © Sergey Potapov

Contributors

  • greyblake Potapov Sergey - creator, maintainer.
  • Dr-Emann Zachary Dremann - optimization and improvements