Whatlang
Natural language detection in Rust.
Features
- Support more than 50 languages
- 100% written in Rust
- No external dependencies
- Super fast
- Recognizes not only a language, but also a script (Latin, Cyrillic, etc)
Get started
The library is still in active development. Here is the short example how to use it:
Add to you Cargo.toml:
[dependencies]
whatlang = "*"
In you program:
extern crate whatlang;
use ;
Blacklist
Your can blacklist undesired languages, passing a vector. In the example blow English and Spanish will be ignored:
let list = ;
let query = new.blacklist;
Whitelist
In similar way, you can whitelist specified languages.
In this example, the library will recognize only Esperanto and Russian.
Note, if it detects a script that is different from Latin(Esperanto)
or Cyrillic(Russian), e.g. Greek, it will return None.
let list = ;
let query = new.whitelist;
Roadmap
- Support 100 most popular languages
Allow to specify blacklist for QueryAllow to specify whitelist for Query- Improve README example
- Create demo application
- Provide some metrics about reliability in
Resultstruct - Write doc for public structurs and functions
- Tune performance
- Support syntax sugar:
// Get result
let result = new.detect;
// Same with blacklist/whitelist, getting directly Lang
let lang = new.blacklist.detect_lang;
Supported languages
| Language | ISO 639-3 | Enum |
|---|---|---|
| Esperanto | epo | Lang::Epo |
| English | eng | Lang::Eng |
| Russian | rus | Lang::Rus |
| Mandarin | cmn | Lang::Cmn |
| Spanish | spa | Lang::Spa |
| Portuguese | por | Lang::Por |
| Italian | ita | Lang::Ita |
| Bengali | ben | Lang::Ben |
| French | fra | Lang::Fra |
| German | deu | Lang::Deu |
| Ukrainian | ukr | Lang::Ukr |
| Georgian | kat | Lang::Kat |
| Arabic | arb | Lang::Arb |
| Hindi | hin | Lang::Hin |
| Japanese | jpn | Lang::Jpn |
| Hebrew | heb | Lang::Heb |
| Yiddish | ydd | Lang::Ydd |
| Polish | pol | Lang::Pol |
| Amharic | amh | Lang::Amh |
| Tigrinya | tir | Lang::Tir |
| Javanese | jav | Lang::Jav |
| Korean | kor | Lang::Kor |
| Bokmal | nob | Lang::Nob |
| Nynorsk | nno | Lang::Nno |
| Danish | dan | Lang::Dan |
| Swedish | swe | Lang::Swe |
| Finnish | fin | Lang::Fin |
| Turkish | tur | Lang::Tur |
| Dutch | nld | Lang::Nld |
| Hungarian | hun | Lang::Hun |
| Czech | ces | Lang::Ces |
| Greek | ell | Lang::Ell |
| Bulgarian | bul | Lang::Bul |
| Belarusian | bel | Lang::Bel |
| Marathi | mar | Lang::Mar |
| Kannada | kan | Lang::Kan |
| Romanian | ron | Lang::Ron |
| Slovene | slv | Lang::Slv |
| Croatian | hrv | Lang::Hrv |
| Serbian | srp | Lang::Srp |
| Macedonian | mkd | Lang::Mkd |
| Lithuanian | lit | Lang::Lit |
| Latvian | lav | Lang::Lav |
| Estonian | est | Lang::Est |
| Tamil | tam | Lang::Tam |
| Vietnamese | vie | Lang::Vie |
| Urdu | urd | Lang::Urd |
| Thai | tha | Lang::Tha |
| Gujarati | guj | Lang::Guj |
| Uzbek | uzb | Lang::Uzb |
| Punjabi | pan | Lang::Pan |
| Azerbaijani | azj | Lang::Azj |
| Indonesian | ind | Lang::Ind |
| Telugu | tel | Lang::Tel |
| Persian | pes | Lang::Pes |
| Malayalam | mal | Lang::Mal |
| Hausa | hau | Lang::Hau |
| Oriya | ori | Lang::Ori |
| Burmese | mya | Lang::Mya |
| Bhojpuri | bho | Lang::Bho |
| Tagalog | tgl | Lang::Tgl |
| Yoruba | yor | Lang::Yor |
| Maithili | mai | Lang::Mai |
License
MIT
Acknowledgments
- Thanks Franc JS for trigrams dataset.
Contributors
- greyblake Potapov Sergey - creator, maintainer.