langdetect-rs
Port of Mimino666's langdetect which is Python-based port of Nakatani Shuyo's language-detection Java-based library. Even this README is mostly a copy of the Mimino666's one.
Language identification library for Rust.
W.I.P.
- Benchmarking in term of speed (via hyperfine?)
- Threadsafe API (do we need it though?)
Table of Contents
- Installation
- Supported Rust Versions
- Languages
- Example
- Language detection reproducibility
- Adding new languages
- How to train for new language?
- Original project
Installation
Add to your Cargo.toml:
= "*"
or run
cargo add langdetect-rs
Supported Rust Versions
Tested on Rust 1.91.0 (rustc 1.91.0 (f8297e351 2025-10-28))
Languages
langdetect-rs supports 55 languages out of the box (ISO 639-1 codes):
af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he,
hi, hr, hu, id, it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl,
pt, ro, ru, sk, sl, so, sq, sv, sw, ta, te, th, tl, tr, uk, ur, vi, zh-cn,
zh-tw
Example
All examples:
Using default detector
-
Simple good-to-go example code in examples/simple/main.rs:
use DetectorFactory;
Custom detection factory
-
Defining
DetectorFactoryfrom scratch for specific languages - ./examples/custom_profile/main.rsuse DetectorFactory; use ; use Path;
Language detection reproducibility
Language detection algorithm is non-deterministic, which means that if you try to run it on a text which is either too short or too ambiguous, you might get different results every time you run it.
To enforce consistent results, set the seed on the detector before detection:
let factory_with_seed = default
.with_seed
.build;
match factory_with_seed.detect
Adding new languages
- How to add language to existing
DetectorFactory(either default initialized or custom)?- The way add_profile works makes it is not possible to add new language profiles to the factory unless you know the final size of languages array in advance. E.g. you initialized custom factory with 5 languages, and now you want to add 2 more - you need to provide
langsizeparameter as 7 when adding EACH new profile. Failing to do so will result in error. - So it is needed to initialize the factory with all desired languages at once. In case if you want to add more languages to the default factory, you can create a new custom factory and add all default profiles from profiles folder plus your new ones.
- Helper function: Use
DetectorFactory::get_default_profiles_path()to get the path to the default language profile files. This is useful when you want to load default profiles manually for extending the factory.
- The way add_profile works makes it is not possible to add new language profiles to the factory unless you know the final size of languages array in advance. E.g. you initialized custom factory with 5 languages, and now you want to add 2 more - you need to provide
- For extending default profiles with your own generated ones, you may refer to this particular example and the section below in this document.
How to train for new language?
To add a new language, you need to create a language profile for it.
Check scripts/README.md to get instructions on scraping data for profile generation using the scrap_wiki.py script located in the scripts folder and then generating the profile via generate_profiles.py script
Initially an idea has been take from original Python library: https://github.com/Mimino666/langdetect?tab=readme-ov-file#how-to-add-new-language. Little bit of searching around the web gave me this repository on which the scripts are based.
Note: scripts are in Python. Therefore your way may vary and you could implement similar functionality e.g. in Rust or any other tool of your choice. The main purpose of this section is to give you a hint where to start from.
Original project
Presentation of the language detection algorithm (on which original implementation is based): http://www.slideshare.net/shuyo/language-detection-library-for-java.