Expand description
§langdetect-rs
A Rust port of the Python langdetect library - https://github.com/Mimino666/langdetect, which is itself a port of the Java language-detection library.
This crate provides automatic language identification using n-gram based text categorization. It supports 55 languages out of the box and allows for custom language profile loading.
§Features
- 55 built-in languages with prepared profiles (copied from Python library version)
- High accuracy for texts longer than 20-50 characters according to original presentation (49 languages with 99.8% precision): https://www.slideshare.net/slideshow/language-detection-library-for-java/6014274
- Non-deterministic algorithm with optional seeding for reproducibility
- Extensible - add custom language profiles
§Quick Start
use langdetect_rs::detector_factory::DetectorFactory;
let factory = DetectorFactory::default().build();
match factory.detect("Hello world! My name is Dima and I am a developer", None) {
Ok(lang) => println!("Detected language: {}", lang),
Err(e) => println!("Detection error: {:?}", e),
}§Algorithm Overview
The library uses a Bayesian approach with n-gram (1-3 character sequences) frequency analysis. It employs an iterative expectation-maximization algorithm to estimate language probabilities.
§Modules
detector_factory- Factory with languages profiles for creating detectorsdetector- Core language detection logiclanguage- Language probability data structureutils- Utility modules for profiles, n-grams, and Unicode handling
Modules§
- detector
- detector_
factory - language
- utils
- Utility modules for language detection.