lingua/
lib.rs

1/*
2 * Copyright © 2020-present Peter M. Stahl pemistahl@gmail.com
3 *
4 * Licensed under the Apache License, Version 2.0 (the "License");
5 * you may not use this file except in compliance with the License.
6 * You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either expressed or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17//! ## 1. What does this library do?
18//!
19//! Its task is simple: It tells you which language some text is written in.
20//! This is very useful as a preprocessing step for linguistic data in natural language
21//! processing applications such as text classification and spell checking.
22//! Other use cases, for instance, might include routing e-mails to the right geographically
23//! located customer service department, based on the e-mails' languages.
24//!
25//! ## 2. Why does this library exist?
26//!
27//! Language detection is often done as part of large machine learning frameworks or natural
28//! language processing applications. In cases where you don't need the full-fledged
29//! functionality of those systems or don't want to learn the ropes of those,
30//! a small flexible library comes in handy.
31//!
32//! So far, other comprehensive open source libraries in the Rust ecosystem for
33//! this task are [*CLD2*](https://github.com/emk/rust-cld2),
34//! [*Whatlang*](https://github.com/greyblake/whatlang-rs) and
35//! [*Whichlang*](https://github.com/quickwit-oss/whichlang).
36//! Unfortunately, most of them have two major drawbacks:
37//!
38//! 1. Detection only works with quite lengthy text fragments. For very short text snippets
39//!    such as Twitter messages, it does not provide adequate results.
40//! 2. The more languages take part in the decision process, the less accurate are the
41//!    detection results.
42//!
43//! *Lingua* aims at eliminating these problems. She nearly does not need any configuration and
44//! yields pretty accurate results on both long and short text, even on single words and phrases.
45//! She draws on both rule-based and statistical Naive Bayes methods but does not use neural networks
46//! or any dictionaries of words. She does not need a connection to any external API or service either.
47//! Once the library has been downloaded, it can be used completely offline.
48//!
49//! ## 3. Which languages are supported?
50//!
51//! Compared to other language detection libraries, *Lingua's* focus is on *quality over quantity*,
52//! that is, getting detection right for a small set of languages first before adding new ones.
53//! Currently, 75 languages are supported. They are listed as variants in the [Language] enum.
54//!
55//! ## 4. How good is it?
56//!
57//! *Lingua* is able to report accuracy statistics for some bundled test data available for each
58//! supported language. The test data for each language is split into three parts:
59//!
60//! 1. a list of single words with a minimum length of 5 characters
61//! 2. a list of word pairs with a minimum length of 10 characters
62//! 3. a list of complete grammatical sentences of various lengths
63//!
64//! Both the language models and the test data have been created from separate documents of the
65//! [Wortschatz corpora](https://wortschatz.uni-leipzig.de) offered by Leipzig University, Germany.
66//! Data crawled from various news websites have been used for training, each corpus comprising one
67//! million sentences. For testing, corpora made of arbitrarily chosen websites have been used,
68//! each comprising ten thousand sentences. From each test corpus, a random unsorted subset of
69//! 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.
70//!
71//! Given the generated test data, I have compared the detection results of *Lingua*, *CLD2*,
72//! *Whatlang* and *Whichlang* running over the data of *Lingua's* supported 75 languages.
73//! Languages that are not supported by the other classifiers are simply ignored for the
74//! respective library during the detection process.
75//!
76//! The results of this comparison are available
77//! [here](https://github.com/pemistahl/lingua-rs#4-how-accurate-is-it).
78//!
79//! ## 5. Why is it better than other libraries?
80//!
81//! Every language detector uses a probabilistic [n-gram](https://en.wikipedia.org/wiki/N-gram)
82//! model trained on the character distribution in some training corpus. Most libraries only use
83//! n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text
84//! fragments consisting of multiple sentences. For short phrases or single words, however,
85//! trigrams are not enough. The shorter the input text is, the less n-grams are available.
86//! The probabilities estimated from such few n-grams are not reliable. This is why *Lingua* makes
87//! use of n-grams of sizes 1 up to 5 which results in much more accurate prediction of the correct
88//! language.
89//!
90//! A second important difference is that *Lingua* does not only use such a statistical model, but
91//! also a rule-based engine. This engine first determines the alphabet of the input text and
92//! searches for characters which are unique in one or more languages. If exactly one language can
93//! be reliably chosen this way, the statistical model is not necessary anymore. In any case, the
94//! rule-based engine filters out languages that do not satisfy the conditions of the input text.
95//! Only then, in a second step, the probabilistic n-gram model is taken into consideration.
96//! This makes sense because loading less language models means less memory consumption and better
97//! runtime performance.
98//!
99//! In general, it is always a good idea to restrict the set of languages to be considered in the
100//! classification process using the respective api methods. If you know beforehand that certain
101//! languages are never to occur in an input text, do not let those take part in the classification
102//! process. The filtering mechanism of the rule-based engine is quite good, however, filtering
103//! based on your own knowledge of the input text is always preferable.
104//!
105//! ## 6. How to add it to your project?
106//!
107//! Add *Lingua* to your `Cargo.toml` file like so:
108//!
109//! ```toml
110//! [dependencies]
111//! lingua = "1.7.1"
112//! ```
113//!
114//! By default, this will download the language model dependencies for all 75 supported languages,
115//! a total of approximately 110 MB. If your bandwidth or hard drive space is limited, or you simply
116//! do not need all languages, you can specify a subset of the language models to be downloaded as
117//! separate features in your `Cargo.toml`:
118//!
119//! ```toml
120//! [dependencies]
121//! lingua = { version = "1.7.1", default-features = false, features = ["french", "italian", "spanish"] }
122//! ```
123//!
124//! ## 7. How to use?
125//!
126//! ### 7.1 Basic usage
127//!
128//! ```
129//! use lingua::{Language, LanguageDetector, LanguageDetectorBuilder};
130//! use lingua::Language::{English, French, German, Spanish};
131//!
132//! let languages = vec![English, French, German, Spanish];
133//! let detector: LanguageDetector = LanguageDetectorBuilder::from_languages(&languages).build();
134//! let detected_language: Option<Language> = detector.detect_language_of("languages are awesome");
135//!
136//! assert_eq!(detected_language, Some(English));
137//! ```
138//!
139//! The entire library is thread-safe, i.e. you can use a single `LanguageDetector` instance and
140//! its methods in multiple threads. Multiple instances of `LanguageDetector` share thread-safe
141//! access to the language models, so every language model is loaded into memory just once, no
142//! matter how many instances of `LanguageDetector` have been created.
143//!
144//! ### 7.2 Minimum relative distance
145//!
146//! By default, *Lingua* returns the most likely language for a given input text. However, there are
147//! certain words that are spelled the same in more than one language. The word *prologue*, for
148//! instance, is both a valid English and French word. *Lingua* would output either English or
149//! French which might be wrong in the given context. For cases like that, it is possible to
150//! specify a minimum relative distance that the logarithmized and summed up probabilities for
151//! each possible language have to satisfy. It can be stated in the following way:
152//!
153//! ```
154//! use lingua::LanguageDetectorBuilder;
155//! use lingua::Language::{English, French, German, Spanish};
156//!
157//! let detector = LanguageDetectorBuilder::from_languages(&[English, French, German, Spanish])
158//!     .with_minimum_relative_distance(0.9)
159//!     .build();
160//! let detected_language = detector.detect_language_of("languages are awesome");
161//!
162//! assert_eq!(detected_language, None);
163//! ```
164//!
165//! Be aware that the distance between the language probabilities is dependent on the length of the
166//! input text. The longer the input text, the larger the distance between the languages. So if you
167//! want to classify very short text phrases, do not set the minimum relative distance too high.
168//! Otherwise [`None`](https://doc.rust-lang.org/std/option/enum.Option.html#variant.None) will be
169//! returned most of the time as in the example above. This is the return value for cases where
170//! language detection is not reliably possible.
171//!
172//! ### 7.3 Confidence values
173//!
174//! Knowing about the most likely language is nice but how reliable is the computed likelihood?
175//! And how less likely are the other examined languages in comparison to the most likely one?
176//! These questions can be answered as well:
177//!
178//! ```
179//! use lingua::Language::{English, French, German, Spanish};
180//! use lingua::{Language, LanguageDetectorBuilder};
181//!
182//! let languages = vec![English, French, German, Spanish];
183//! let detector = LanguageDetectorBuilder::from_languages(&languages).build();
184//! let confidence_values: Vec<(Language, f64)> = detector
185//!     .compute_language_confidence_values("languages are awesome")
186//!     .into_iter()
187//!     // Let's round the values to two decimal places for easier assertions
188//!     .map(|(language, confidence)| (language, (confidence * 100.0).round() / 100.0))
189//!     .collect();
190//!
191//! assert_eq!(
192//!     confidence_values,
193//!     vec![(English, 0.93), (French, 0.04), (German, 0.02), (Spanish, 0.01)]
194//! );
195//! ```
196//!
197//! In the example above, a vector of two-element tuples is returned containing all possible
198//! languages sorted by their confidence value in descending order. Each value is a probability
199//! between 0.0 and 1.0. The probabilities of all languages will sum to 1.0. If the language is
200//! unambiguously identified by the rule engine, the value 1.0 will always be returned for this
201//! language. The other languages will receive a value of 0.0.
202//!
203//! There is also a method for returning the confidence value for one specific language only:
204//!
205//! ```
206//! use lingua::Language::{English, French, German, Spanish};
207//! use lingua::LanguageDetectorBuilder;
208//!
209//! let languages = vec![English, French, German, Spanish];
210//! let detector = LanguageDetectorBuilder::from_languages(&languages).build();
211//! let confidence = detector.compute_language_confidence("languages are awesome", French);
212//! let rounded_confidence = (confidence * 100.0).round() / 100.0;
213//!
214//! assert_eq!(rounded_confidence, 0.04);
215//! ```
216//!
217//! The value that this method computes is a number between 0.0 and 1.0.
218//! If the language is unambiguously identified by the rule engine, the value
219//! 1.0 will always be returned. If the given language is not supported by
220//! this detector instance, the value 0.0 will always be returned.
221//!
222//! ### 7.4 Eager loading versus lazy loading
223//!
224//! By default, *Lingua* uses lazy-loading to load only those language models on demand which are
225//! considered relevant by the rule-based filter engine. For web services, for instance, it is
226//! rather beneficial to preload all language models into memory to avoid unexpected latency while
227//! waiting for the service response. If you want to enable the eager-loading mode, you can do it
228//! like this:
229//!
230//! ```
231//! use lingua::LanguageDetectorBuilder;
232//!
233//! LanguageDetectorBuilder::from_all_languages().with_preloaded_language_models().build();
234//! ```
235//!
236//! Multiple instances of `LanguageDetector` share the same language models in memory which are
237//! accessed asynchronously by the instances.
238//!
239//! ### 7.5 Low accuracy mode versus high accuracy mode
240//!
241//! *Lingua's* high detection accuracy comes at the cost of being noticeably slower
242//! than other language detectors. The large language models also consume significant
243//! amounts of memory. These requirements might not be feasible for systems running low
244//! on resources. If you want to classify mostly long texts or need to save resources,
245//! you can enable a *low accuracy mode* that loads only a small subset of the language
246//! models into memory:
247//!
248//! ```
249//! use lingua::LanguageDetectorBuilder;
250//!
251//! LanguageDetectorBuilder::from_all_languages().with_low_accuracy_mode().build();
252//! ```
253//!
254//! The downside of this approach is that detection accuracy for short texts consisting
255//! of less than 120 characters will drop significantly. However, detection accuracy for
256//! texts which are longer than 120 characters will remain mostly unaffected.
257//!
258//! In high accuracy mode (the default), the language detector consumes approximately
259//! 1 GB of memory if all language models are loaded. In low accuracy mode, memory
260//! consumption is reduced to approximately 100 MB. The goal is to further reduce memory
261//! consumption in later releases.
262//!
263//! An alternative for a smaller memory footprint and faster performance is to reduce the set
264//! of languages when building the language detector. In most cases, it is not advisable to
265//! build the detector from all supported languages. When you have knowledge about
266//! the texts you want to classify you can almost always rule out certain languages as impossible
267//! or unlikely to occur.
268//!
269//! ### 7.6 Single-language mode
270//!
271//! If you build a `LanguageDetector` from one language only it will operate in single-language mode.
272//! This means the detector will try to find out whether a given text has been written in the given language or not.
273//! If not, then `None` will be returned, otherwise the given language.
274//!
275//! In single-language mode, the detector decides based on a set of unique and most common n-grams which
276//! have been collected beforehand for every supported language. It turns out that unique and most common
277//! n-grams help to improve accuracy in low accuracy mode, so they are used for that mode as well. In high
278//! accuracy mode, however, they do not make a significant difference, that's why they are left out.
279//!
280//! ### 7.7 Detection of multiple languages in mixed-language texts
281//!
282//! In contrast to most other language detectors, *Lingua* is able to detect multiple languages
283//! in mixed-language texts. This feature can yield quite reasonable results, but it is still
284//! in an experimental state and therefore the detection result is highly dependent on the input
285//! text. It works best in high-accuracy mode with multiple long words for each language.
286//! The shorter the phrases and their words are, the less accurate are the results. Reducing the
287//! set of languages when building the language detector can also improve accuracy for this task
288//! if the languages occurring in the text are equal to the languages supported by the respective
289//! language detector instance.
290//!
291//! ```
292//! use lingua::DetectionResult;
293//! use lingua::Language::{English, French, German};
294//! use lingua::LanguageDetectorBuilder;
295//!
296//! let languages = vec![English, French, German];
297//! let detector = LanguageDetectorBuilder::from_languages(&languages).build();
298//! let sentence = "Parlez-vous français? \
299//!     Ich spreche Französisch nur ein bisschen. \
300//!     A little bit is better than nothing.";
301//!
302//! let results: Vec<DetectionResult> = detector.detect_multiple_languages_of(sentence);
303//!
304//! if let [first, second, third] = &results[..] {
305//!     assert_eq!(first.language(), French);
306//!     assert_eq!(
307//!         &sentence[first.start_index()..first.end_index()],
308//!         "Parlez-vous français? "
309//!     );
310//!
311//!     assert_eq!(second.language(), German);
312//!     assert_eq!(
313//!         &sentence[second.start_index()..second.end_index()],
314//!         "Ich spreche Französisch nur ein bisschen. "
315//!     );
316//!
317//!     assert_eq!(third.language(), English);
318//!     assert_eq!(
319//!         &sentence[third.start_index()..third.end_index()],
320//!         "A little bit is better than nothing."
321//!     );
322//! }
323//! ```
324//!
325//! In the example above, a vector of [DetectionResult] is returned. Each entry in the vector
326//! describes a contiguous single-language text section, providing start and end indices of the
327//! respective substring.
328//!
329//! ### 7.8 Single-threaded versus multi-threaded language detection
330//!
331//! The `LanguageDetector` methods explained above all operate in a single thread.
332//! If you want to classify a very large set of texts, you will probably want to
333//! use all available CPU cores efficiently in multiple threads for maximum performance.
334//!
335//! Every single-threaded method has a multi-threaded equivalent that accepts a list of texts
336//! and returns a list of results.
337//!
338//! | Single-threaded                      | Multi-threaded                                   |
339//! |--------------------------------------|--------------------------------------------------|
340//! | `detect_language_of`                 | `detect_languages_in_parallel_of`                |
341//! | `detect_multiple_languages_of`       | `detect_multiple_languages_in_parallel_of`       |
342//! | `compute_language_confidence_values` | `compute_language_confidence_values_in_parallel` |
343//! | `compute_language_confidence`        | `compute_language_confidence_in_parallel`        |
344//!
345//! ### 7.9 Methods to build the LanguageDetector
346//!
347//! There might be classification tasks where you know beforehand that your language data is
348//! definitely not written in Latin, for instance (what a surprise :-). The detection accuracy can
349//! become better in such cases if you exclude certain languages from the decision process or just
350//! explicitly include relevant languages:
351//!
352//! ```
353//! use lingua::{LanguageDetectorBuilder, Language, IsoCode639_1, IsoCode639_3};
354//!
355//! // Include all languages available in the library.
356//! LanguageDetectorBuilder::from_all_languages();
357//!
358//! // Include only languages that are not yet extinct (= currently excludes Latin).
359//! LanguageDetectorBuilder::from_all_spoken_languages();
360//!
361//! // Include only languages written with Cyrillic script.
362//! LanguageDetectorBuilder::from_all_languages_with_cyrillic_script();
363//!
364//! // Exclude only the Spanish language from the decision algorithm.
365//! LanguageDetectorBuilder::from_all_languages_without(&[Language::Spanish]);
366//!
367//! // Only decide between English and German.
368//! LanguageDetectorBuilder::from_languages(&[Language::English, Language::German]);
369//!
370//! // Select languages by ISO 639-1 code.
371//! LanguageDetectorBuilder::from_iso_codes_639_1(&[IsoCode639_1::EN, IsoCode639_1::DE]);
372//!
373//! // Select languages by ISO 639-3 code.
374//! LanguageDetectorBuilder::from_iso_codes_639_3(&[IsoCode639_3::ENG, IsoCode639_3::DEU]);
375//! ```
376//!
377//! ## 8. WebAssembly support
378//!
379//! This library can be compiled to [WebAssembly (WASM)](https://webassembly.org) which allows to
380//! use *Lingua* in any JavaScript-based project, be it in the browser or in the back end running on
381//! [Node.js](https://nodejs.org).
382//!
383//! The easiest way to compile is to use [`wasm-pack`](https://rustwasm.github.io/wasm-pack).
384//! After the installation, you can, for instance, build the library with the web target so that it
385//! can be directly used in the browser:
386//!
387//! ```shell
388//! wasm-pack build --target web
389//! ```
390//!
391//! By default, all 75 supported languages are included in the compiled wasm file which has a size
392//! of 96 MB, approximately. If you only need a subset of certain languages, you can tell `wasm-pack`
393//! which ones to include:
394//!
395//! ```shell
396//! wasm-pack build --target web -- --no-default-features --features "french,italian,spanish"
397//! ```
398//!
399//! The output of `wasm-pack` will be hosted in a
400//! [separate repository](https://github.com/pemistahl/lingua-js) which allows to add further
401//! JavaScript-related configuration, tests and documentation. *Lingua* will then be added to the
402//! [npm registry](https://www.npmjs.com) as well, allowing for an easy download and installation
403//! within every JavaScript or TypeScript project.
404
405#[macro_use]
406extern crate maplit;
407
408#[cfg(test)]
409use regex::Regex;
410
411pub use builder::LanguageDetectorBuilder;
412pub use detector::LanguageDetector;
413pub use isocode::{IsoCode639_1, IsoCode639_3};
414pub use language::Language;
415pub use result::DetectionResult;
416#[cfg(target_family = "wasm")]
417pub use wasm::{
418    ConfidenceValue, DetectionResult as WasmDetectionResult,
419    LanguageDetectorBuilder as WasmLanguageDetectorBuilder,
420};
421pub use writer::{LanguageModelFilesWriter, TestDataFilesWriter};
422
423mod alphabet;
424mod builder;
425mod constant;
426mod detector;
427mod isocode;
428mod json;
429mod language;
430mod model;
431mod ngram;
432mod result;
433mod script;
434mod writer;
435
436#[cfg(feature = "python")]
437mod python;
438
439#[cfg(target_family = "wasm")]
440mod wasm;
441
442#[cfg(any(target_family = "wasm", feature = "python"))]
443pub(crate) fn convert_byte_indices_to_char_indices(
444    results: &Vec<DetectionResult>,
445    text: &str,
446) -> Vec<DetectionResult> {
447    let mut converted_results: Vec<DetectionResult> = vec![];
448
449    for i in 0..results.len() {
450        let result = results[i];
451        let chars_count = text[result.start_index..result.end_index].chars().count();
452        let start_index = if i == 0 {
453            0
454        } else {
455            converted_results[i - 1].end_index
456        };
457        let end_index = start_index + chars_count;
458        converted_results.push(DetectionResult {
459            start_index,
460            end_index,
461            word_count: result.word_count,
462            language: result.language,
463        });
464    }
465
466    converted_results
467}
468
469#[cfg(test)]
470pub(crate) fn minify(json: &str) -> String {
471    let re = Regex::new("\n\\s*").unwrap();
472    re.replace_all(json, "").to_string()
473}