Skip to main content

lingua/
lib.rs

1/*
2 * Copyright © 2020-present Peter M. Stahl pemistahl@gmail.com
3 *
4 * Licensed under the Apache License, Version 2.0 (the "License");
5 * you may not use this file except in compliance with the License.
6 * You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either expressed or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17//! ## 1. What does this library do?
18//!
19//! Its task is simple: It tells you which language some text is written in.
20//! This is very useful as a preprocessing step for linguistic data in natural language
21//! processing applications such as text classification and spell checking.
22//! Other use cases, for instance, might include routing e-mails to the right geographically
23//! located customer service department, based on the e-mails' languages.
24//!
25//! ## 2. Why does this library exist?
26//!
27//! Language detection is often done as part of large machine learning frameworks or natural
28//! language processing applications. In cases where you don't need the full-fledged
29//! functionality of those systems or don't want to learn the ropes of those,
30//! a small flexible library comes in handy.
31//!
32//! So far, other comprehensive open source libraries in the Rust ecosystem for
33//! this task are [*CLD2*](https://github.com/emk/rust-cld2),
34//! [*Whatlang*](https://github.com/greyblake/whatlang-rs) and
35//! [*Whichlang*](https://github.com/quickwit-oss/whichlang).
36//! Unfortunately, most of them have two major drawbacks:
37//!
38//! 1. Detection only works with quite lengthy text fragments. For very short text snippets
39//!    such as Twitter messages, it does not provide adequate results.
40//! 2. The more languages take part in the decision process, the less accurate are the
41//!    detection results.
42//!
43//! *Lingua* aims at eliminating these problems. She nearly does not need any configuration and
44//! yields pretty accurate results on both long and short text, even on single words and phrases.
45//! She draws on both rule-based and statistical Naive Bayes methods but does not use neural networks
46//! or any dictionaries of words. She does not need a connection to any external API or service either.
47//! Once the library has been downloaded, it can be used completely offline.
48//!
49//! ## 3. Which languages are supported?
50//!
51//! Compared to other language detection libraries, *Lingua's* focus is on *quality over quantity*,
52//! that is, getting detection right for a small set of languages first before adding new ones.
53//! Currently, 75 languages are supported. They are listed as variants in the [Language] enum.
54//!
55//! ## 4. How good is it?
56//!
57//! *Lingua* is able to report accuracy statistics for some bundled test data available for each
58//! supported language. The test data for each language is split into three parts:
59//!
60//! 1. a list of single words with a minimum length of 5 characters
61//! 2. a list of word pairs with a minimum length of 10 characters
62//! 3. a list of complete grammatical sentences of various lengths
63//!
64//! Both the language models and the test data have been created from separate documents of the
65//! [Wortschatz corpora](https://wortschatz.uni-leipzig.de) offered by Leipzig University, Germany.
66//! Data crawled from various news websites have been used for training, each corpus comprising one
67//! million sentences. For testing, corpora made of arbitrarily chosen websites have been used,
68//! each comprising ten thousand sentences. From each test corpus, a random unsorted subset of
69//! 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.
70//!
71//! Given the generated test data, I have compared the detection results of *Lingua*, *CLD2*,
72//! *Whatlang* and *Whichlang* running over the data of *Lingua's* supported 75 languages.
73//! Languages that are not supported by the other classifiers are simply ignored for the
74//! respective library during the detection process.
75//!
76//! The results of this comparison are available
77//! [here](https://github.com/pemistahl/lingua-rs#4-how-accurate-is-it).
78//!
79//! ## 5. Why is it better than other libraries?
80//!
81//! Every language detector uses a probabilistic [n-gram](https://en.wikipedia.org/wiki/N-gram)
82//! model trained on the character distribution in some training corpus. Most libraries only use
83//! n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text
84//! fragments consisting of multiple sentences. For short phrases or single words, however,
85//! trigrams are not enough. The shorter the input text is, the less n-grams are available.
86//! The probabilities estimated from such few n-grams are not reliable. This is why *Lingua* makes
87//! use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct
88//! language.
89//!
90//! A second important difference is that *Lingua* does not only use such a statistical model, but
91//! also a rule-based engine. This engine first determines the alphabet of the input text and
92//! searches for characters which are unique in one or more languages. If exactly one language can
93//! be reliably chosen this way, the statistical model is not necessary anymore. In any case, the
94//! rule-based engine filters out languages that do not satisfy the conditions of the input text.
95//! Only then, in a second step, the probabilistic n-gram model is taken into consideration.
96//! This makes sense because loading less language models means less memory consumption and better
97//! runtime performance.
98//!
99//! In general, it is always a good idea to restrict the set of languages to be considered in the
100//! classification process using the respective api methods. If you know beforehand that certain
101//! languages are never to occur in an input text, do not let those take part in the classification
102//! process. The filtering mechanism of the rule-based engine is quite good, however, filtering
103//! based on your own knowledge of the input text is always preferable.
104//!
105//! Even when taking all language models into account, the library uses only a few dozen megabytes
106//! of memory during runtime. This is because the models are stored as finite-state transducers (FSTs).
107//! FSTs allow to be searched on disk without actually reading them entirely into memory, making the
108//! library suitable for low-resource environments.
109//!
110//! ## 6. How to add it to your project?
111//!
112//! Add *Lingua* to your `Cargo.toml` file like so:
113//!
114//! ```toml
115//! [dependencies]
116//! lingua = "1.8.0"
117//! ```
118//!
119//! By default, this will download the language model dependencies for all 75 supported languages,
120//! a total of approximately 300 MB. If your bandwidth or hard drive space is limited, or you simply
121//! do not need all languages, you can specify a subset of the language models to be downloaded as
122//! separate features in your `Cargo.toml`:
123//!
124//! ```toml
125//! [dependencies]
126//! lingua = { version = "1.8.0", default-features = false, features = ["french", "italian", "spanish"] }
127//! ```
128//!
129//! ## 7. How to use?
130//!
131//! ### 7.1 Basic usage
132//!
133//! ```
134//! use lingua::{Language, LanguageDetector, LanguageDetectorBuilder};
135//! use lingua::Language::{English, French, German, Spanish};
136//!
137//! let languages = vec![English, French, German, Spanish];
138//! let detector: LanguageDetector = LanguageDetectorBuilder::from_languages(&languages).build();
139//! let detected_language: Option<Language> = detector.detect_language_of("languages are awesome");
140//!
141//! assert_eq!(detected_language, Some(English));
142//! ```
143//!
144//! The entire library is thread-safe, i.e. you can use a single `LanguageDetector` instance and
145//! its methods in multiple threads. Multiple instances of `LanguageDetector` share thread-safe
146//! access to the language models, so every language model is loaded into memory just once, no
147//! matter how many instances of `LanguageDetector` have been created.
148//!
149//! ### 7.2 Minimum relative distance
150//!
151//! By default, *Lingua* returns the most likely language for a given input text. However, there are
152//! certain words that are spelled the same in more than one language. The word *prologue*, for
153//! instance, is both a valid English and French word. *Lingua* would output either English or
154//! French which might be wrong in the given context. For cases like that, it is possible to
155//! specify a minimum relative distance that the logarithmized and summed up probabilities for
156//! each possible language have to satisfy. It can be stated in the following way:
157//!
158//! ```
159//! use lingua::LanguageDetectorBuilder;
160//! use lingua::Language::{English, French, German, Spanish};
161//!
162//! let detector = LanguageDetectorBuilder::from_languages(&[English, French, German, Spanish])
163//!     .with_minimum_relative_distance(0.9)
164//!     .build();
165//! let detected_language = detector.detect_language_of("languages are awesome");
166//!
167//! assert_eq!(detected_language, None);
168//! ```
169//!
170//! Be aware that the distance between the language probabilities is dependent on the length of the
171//! input text. The longer the input text, the larger the distance between the languages. So if you
172//! want to classify very short text phrases, do not set the minimum relative distance too high.
173//! Otherwise [`None`](https://doc.rust-lang.org/std/option/enum.Option.html#variant.None) will be
174//! returned most of the time as in the example above. This is the return value for cases where
175//! language detection is not reliably possible.
176//!
177//! ### 7.3 Confidence values
178//!
179//! Knowing about the most likely language is nice but how reliable is the computed likelihood?
180//! And how less likely are the other examined languages in comparison to the most likely one?
181//! These questions can be answered as well:
182//!
183//! ```
184//! use lingua::Language::{English, French, German, Spanish};
185//! use lingua::{Language, LanguageDetectorBuilder};
186//!
187//! let languages = vec![English, French, German, Spanish];
188//! let detector = LanguageDetectorBuilder::from_languages(&languages).build();
189//! let confidence_values: Vec<(Language, f64)> = detector
190//!     .compute_language_confidence_values("languages are awesome")
191//!     .into_iter()
192//!     // Let's round the values to two decimal places for easier assertions
193//!     .map(|(language, confidence)| (language, (confidence * 100.0).round() / 100.0))
194//!     .collect();
195//!
196//! assert_eq!(
197//!     confidence_values,
198//!     vec![(English, 0.93), (French, 0.04), (German, 0.02), (Spanish, 0.01)]
199//! );
200//! ```
201//!
202//! In the example above, a vector of two-element tuples is returned containing all possible
203//! languages sorted by their confidence value in descending order. Each value is a probability
204//! between 0.0 and 1.0. The probabilities of all languages will sum to 1.0. If the language is
205//! unambiguously identified by the rule engine, the value 1.0 will always be returned for this
206//! language. The other languages will receive a value of 0.0.
207//!
208//! There is also a method for returning the confidence value for one specific language only:
209//!
210//! ```
211//! use lingua::Language::{English, French, German, Spanish};
212//! use lingua::LanguageDetectorBuilder;
213//!
214//! let languages = vec![English, French, German, Spanish];
215//! let detector = LanguageDetectorBuilder::from_languages(&languages).build();
216//! let confidence = detector.compute_language_confidence("languages are awesome", French);
217//! let rounded_confidence = (confidence * 100.0).round() / 100.0;
218//!
219//! assert_eq!(rounded_confidence, 0.04);
220//! ```
221//!
222//! The value that this method computes is a number between 0.0 and 1.0.
223//! If the language is unambiguously identified by the rule engine, the value
224//! 1.0 will always be returned. If the given language is not supported by
225//! this detector instance, the value 0.0 will always be returned.
226//!
227//! ### 7.4 Eager loading versus lazy loading
228//!
229//! By default, *Lingua* uses lazy-loading to load only those language models on demand which are
230//! considered relevant by the rule-based filter engine. For web services, for instance, it is
231//! rather beneficial to preload all language models into memory to avoid unexpected latency while
232//! waiting for the service response. If you want to enable the eager-loading mode, you can do it
233//! like this:
234//!
235//! ```
236//! use lingua::LanguageDetectorBuilder;
237//!
238//! LanguageDetectorBuilder::from_all_languages().with_preloaded_language_models().build();
239//! ```
240//!
241//! Multiple instances of `LanguageDetector` share the same language models in memory which are
242//! accessed asynchronously by the instances.
243//!
244//! ### 7.5 Low accuracy mode versus high accuracy mode
245//!
246//! *Lingua's* high detection accuracy comes at the cost of being noticeably slower
247//! than other language detectors. This requirement might not be feasible for systems running low
248//! on resources. If you want to classify mostly long texts or need to save resources,
249//! you can enable a *low accuracy mode* that loads only a small subset of the language
250//! models into memory:
251//!
252//! ```
253//! use lingua::LanguageDetectorBuilder;
254//!
255//! LanguageDetectorBuilder::from_all_languages().with_low_accuracy_mode().build();
256//! ```
257//!
258//! The downside of this approach is that detection accuracy for short texts consisting
259//! of less than 120 characters will drop significantly. However, detection accuracy for
260//! texts which are longer than 120 characters will remain mostly unaffected.
261//!
262//! An alternative for a faster performance is to reduce the set
263//! of languages when building the language detector. In most cases, it is not advisable to
264//! build the detector from all supported languages. When you have knowledge about
265//! the texts you want to classify you can almost always rule out certain languages as impossible
266//! or unlikely to occur.
267//!
268//! ### 7.6 Single-language mode
269//!
270//! If you build a `LanguageDetector` from one language only it will operate in single-language mode.
271//! This means the detector will try to find out whether a given text has been written in the given language or not.
272//! If not, then `None` will be returned, otherwise the given language.
273//!
274//! In single-language mode, the detector decides based on a set of unique and most common n-grams which
275//! have been collected beforehand for every supported language. It turns out that unique and most common
276//! n-grams help to improve accuracy in low accuracy mode, so they are used for that mode as well. In high
277//! accuracy mode, however, they do not make a significant difference, that's why they are left out.
278//!
279//! ### 7.7 Detection of multiple languages in mixed-language texts
280//!
281//! In contrast to most other language detectors, *Lingua* is able to detect multiple languages
282//! in mixed-language texts. This feature can yield quite reasonable results, but it is still
283//! in an experimental state and therefore the detection result is highly dependent on the input
284//! text. It works best in high-accuracy mode with multiple long words for each language.
285//! The shorter the phrases and their words are, the less accurate are the results. Reducing the
286//! set of languages when building the language detector can also improve accuracy for this task
287//! if the languages occurring in the text are equal to the languages supported by the respective
288//! language detector instance.
289//!
290//! ```
291//! use lingua::DetectionResult;
292//! use lingua::Language::{English, French, German};
293//! use lingua::LanguageDetectorBuilder;
294//!
295//! let languages = vec![English, French, German];
296//! let detector = LanguageDetectorBuilder::from_languages(&languages).build();
297//! let sentence = "Parlez-vous français? \
298//!     Ich spreche Französisch nur ein bisschen. \
299//!     A little bit is better than nothing.";
300//!
301//! let results: Vec<DetectionResult> = detector.detect_multiple_languages_of(sentence);
302//!
303//! if let [first, second, third] = &results[..] {
304//!     assert_eq!(first.language(), French);
305//!     assert_eq!(
306//!         &sentence[first.start_index()..first.end_index()],
307//!         "Parlez-vous français? "
308//!     );
309//!
310//!     assert_eq!(second.language(), German);
311//!     assert_eq!(
312//!         &sentence[second.start_index()..second.end_index()],
313//!         "Ich spreche Französisch nur ein bisschen. "
314//!     );
315//!
316//!     assert_eq!(third.language(), English);
317//!     assert_eq!(
318//!         &sentence[third.start_index()..third.end_index()],
319//!         "A little bit is better than nothing."
320//!     );
321//! }
322//! ```
323//!
324//! In the example above, a vector of [DetectionResult] is returned. Each entry in the vector
325//! describes a contiguous single-language text section, providing start and end indices of the
326//! respective substring.
327//!
328//! ### 7.8 Single-threaded versus multi-threaded language detection
329//!
330//! The `LanguageDetector` methods explained above all operate in a single thread.
331//! If you want to classify a very large set of texts, you will probably want to
332//! use all available CPU cores efficiently in multiple threads for maximum performance.
333//!
334//! Every single-threaded method has a multi-threaded equivalent that accepts a list of texts
335//! and returns a list of results.
336//!
337//! | Single-threaded                      | Multi-threaded                                   |
338//! |--------------------------------------|--------------------------------------------------|
339//! | `detect_language_of`                 | `detect_languages_in_parallel_of`                |
340//! | `detect_multiple_languages_of`       | `detect_multiple_languages_in_parallel_of`       |
341//! | `compute_language_confidence_values` | `compute_language_confidence_values_in_parallel` |
342//! | `compute_language_confidence`        | `compute_language_confidence_in_parallel`        |
343//!
344//! ### 7.9 Methods to build the LanguageDetector
345//!
346//! There might be classification tasks where you know beforehand that your language data is
347//! definitely not written in Latin, for instance (what a surprise :-). The detection accuracy can
348//! become better in such cases if you exclude certain languages from the decision process or just
349//! explicitly include relevant languages:
350//!
351//! ```
352//! use lingua::{LanguageDetectorBuilder, Language, IsoCode639_1, IsoCode639_3};
353//!
354//! // Include all languages available in the library.
355//! LanguageDetectorBuilder::from_all_languages();
356//!
357//! // Include only languages that are not yet extinct (= currently excludes Latin).
358//! LanguageDetectorBuilder::from_all_spoken_languages();
359//!
360//! // Include only languages written with Cyrillic script.
361//! LanguageDetectorBuilder::from_all_languages_with_cyrillic_script();
362//!
363//! // Exclude only the Spanish language from the decision algorithm.
364//! LanguageDetectorBuilder::from_all_languages_without(&[Language::Spanish]);
365//!
366//! // Only decide between English and German.
367//! LanguageDetectorBuilder::from_languages(&[Language::English, Language::German]);
368//!
369//! // Select languages by ISO 639-1 code.
370//! LanguageDetectorBuilder::from_iso_codes_639_1(&[IsoCode639_1::EN, IsoCode639_1::DE]);
371//!
372//! // Select languages by ISO 639-3 code.
373//! LanguageDetectorBuilder::from_iso_codes_639_3(&[IsoCode639_3::ENG, IsoCode639_3::DEU]);
374//! ```
375//!
376//! ## 8. WebAssembly support
377//!
378//! This library can be compiled to [WebAssembly (WASM)](https://webassembly.org) which allows to
379//! use *Lingua* in any JavaScript-based project, be it in the browser or in the back end running on
380//! [Node.js](https://nodejs.org).
381//!
382//! The easiest way to compile is to use [`wasm-pack`](https://rustwasm.github.io/wasm-pack).
383//! After the installation, you can, for instance, build the library with the web target so that it
384//! can be directly used in the browser:
385//!
386//! ```shell
387//! wasm-pack build --target web
388//! ```
389//!
390//! By default, all 75 supported languages are included in the compiled wasm file which has a size
391//! of 288 MB, approximately. If you only need a subset of certain languages, you can tell `wasm-pack`
392//! which ones to include:
393//!
394//! ```shell
395//! wasm-pack build --target web -- --no-default-features --features "french,italian,spanish"
396//! ```
397//!
398//! The output of `wasm-pack` will be hosted in a
399//! [separate repository](https://github.com/pemistahl/lingua-js) which allows to add further
400//! JavaScript-related configuration, tests and documentation. *Lingua* will then be added to the
401//! [npm registry](https://www.npmjs.com) as well, allowing for an easy download and installation
402//! within every JavaScript or TypeScript project.
403
404#[macro_use]
405extern crate maplit;
406
407pub use builder::LanguageDetectorBuilder;
408pub use detector::LanguageDetector;
409pub use isocode::{IsoCode639_1, IsoCode639_3};
410pub use language::Language;
411pub use result::DetectionResult;
412#[cfg(target_family = "wasm")]
413pub use wasm::{
414    ConfidenceValue, DetectionResult as WasmDetectionResult,
415    LanguageDetectorBuilder as WasmLanguageDetectorBuilder,
416};
417pub use writer::{
418    LanguageModelFilesWriter, MostCommonNgramsWriter, TestDataFilesWriter, UniqueNgramsWriter,
419};
420
421mod alphabet;
422mod builder;
423mod constant;
424mod detector;
425mod file;
426mod isocode;
427mod language;
428mod model;
429mod ngram;
430mod result;
431mod script;
432mod writer;
433
434#[cfg(feature = "python")]
435mod python;
436
437#[cfg(target_family = "wasm")]
438mod wasm;
439
440#[cfg(any(target_family = "wasm", feature = "python"))]
441pub(crate) fn convert_byte_indices_to_char_indices(
442    results: &Vec<DetectionResult>,
443    text: &str,
444) -> Vec<DetectionResult> {
445    let mut converted_results: Vec<DetectionResult> = vec![];
446
447    for i in 0..results.len() {
448        let result = results[i];
449        let chars_count = text[result.start_index..result.end_index].chars().count();
450        let start_index = if i == 0 {
451            0
452        } else {
453            converted_results[i - 1].end_index
454        };
455        let end_index = start_index + chars_count;
456        converted_results.push(DetectionResult {
457            start_index,
458            end_index,
459            word_count: result.word_count,
460            language: result.language,
461        });
462    }
463
464    converted_results
465}