Skip to main content

text2num/
lib.rs

1/*!
2This crate provides a library for recognizing, parsing and transcribing into digits (base 10) numbers expressed in natural language.
3
4This top level documentation describes the usage of the library with the builtin languages and provides some simple examples.
5
6For more specific details on how to add support for new natural languages (and contributing to the builtin set!), please see the documentation of the [`lang`] module.
7
8
9# Usage
10
11This crate is [on crates.io](https://crates.io/crates/text2num) and can be
12used by adding `text2num` to your dependencies in your project's `Cargo.toml`.
13
14```toml
15[dependencies]
16text2num = "1"
17```
18
19# Example: check some string is a valid number in a given language.
20
21For convenience, the builtin languages are encapsulated into the [`Language`] type so
22you can easily switch languages at runtime.
23
24Each builtin language support regional varieties automatically, so you don't need to specify a region.
25
26The language interpreters are stateless so you can reuse and share them.
27
28```rust
29use text2num::{Language, text2digits};
30
31let en = Language::english();
32
33assert!(
34    text2digits("one hundred fifty-seven", &en).is_ok()
35);
36
37assert!(text2digits("twenty twelve", &en).is_err());
38```
39
40Of course, you can get the base 10 digit representation too:
41
42```rust
43use text2num::{Language, text2digits};
44
45let es = Language::spanish();
46let utterance = "ochenta y cinco";
47
48match text2digits(utterance, &es) {
49    Ok(repr) => println!("'{}' means {} in Spanish", utterance, repr),
50    Err(_) => println!("'{}' is not a number in Spanish", utterance)
51}
52```
53
54When run, the above code should print `'ochenta y cinco' means 85 in Spanish` on the standard output.
55
56If you don't need to dynamically switch languages, you can directly use the appropriate interpreter instead of
57the `Language` type:
58
59```
60use text2num::lang::English;
61use text2num::text2digits;
62
63let en = English::new();
64
65assert!(text2digits("fifty-five", &en).is_ok());
66```
67
68# Example: find and replace numbers in a natural speech string.
69
70Most often, you just want to rewrite a string containing natural speech so that the numbers it contains (cardinals,
71ordinals, decimal numbers) appear in digit (base 10) form instead.
72
73As isolated smaller numbers may be easier to read in plain text, you can specify a threshold under which isolated simple cardinals and ordinals are
74not replaced.
75
76```rust
77use text2num::{Language, replace_numbers_in_text};
78
79let en = Language::english();
80
81let sentence = "Let me show you two things: first, isolated numbers are treated differently than groups like one, two, three. And then, that decimal numbers like three point one four one five are well understood.";
82
83assert_eq!(
84    replace_numbers_in_text(sentence, &en, 10.0),
85    "Let me show you two things: first, isolated numbers are treated differently than groups like 1, 2, 3. And then, that decimal numbers like 3.1415 are well understood."
86);
87
88assert_eq!(
89    replace_numbers_in_text(sentence, &en, 0.0),
90    "Let me show you 2 things: 1st, isolated numbers are treated differently than groups like 1, 2, 3. And then, that decimal numbers like 3.1415 are well understood."
91);
92```
93
94# More advanced usage: operations on token streams.
95
96Among the real life applications of this library are the post-processing of Automatic Speech Recognition (ASR)
97output or taking part in a Natural Language Processing (NLP) pipeline.
98
99In those cases, you'll probably get a stream of tokens of a certain type instead of a string.
100The `text2num` library can process those streams as long as the token type implements the [`Token trait`](word_to_digit::Token).
101
102
103# Example: substitutions in a token list.
104
105We can show a simple example with `String` streams:
106
107```rust
108use text2num::{replace_numbers_in_stream, Language, Token, Replace};
109
110let en = Language::english();
111
112struct BareToken(String);
113
114impl Token for &BareToken {
115    fn text(&self) -> &str {
116        self.0.as_ref()
117    }
118
119    fn text_lowercase(&self) -> &str {
120        self.0.as_ref()
121    }
122}
123
124impl Replace for BareToken {
125    fn replace<I: Iterator<Item = Self>>(_replaced: I, data: String) -> Self {
126        BareToken(data)
127    }
128}
129
130// Poor man's tokenizer
131let token_list = "I have two hundreds and twenty dollars in my pocket".split_whitespace().map(|s| BareToken(s.to_string())).collect();
132
133let processed_stream = replace_numbers_in_stream(token_list, &en, 10.0);
134
135assert_eq!(
136    processed_stream.into_iter().map(|t| t.0).collect::<Vec<_>>(),
137    vec!["I", "have", "220", "dollars", "in", "my", "pocket"]
138);
139```
140
141# Example: find numbers in a token stream.
142
143In this more elaborate example, we show how to implement the `Token` trait on a typical ASR token type and
144how to locate numbers (and their values) in a stream of those tokens.
145
146```rust
147use text2num::{find_numbers, Language, Token};
148
149struct DecodedWord<'a> {
150    text: &'a str,
151    start: u64,  // in milliseconds
152    end: u64
153}
154
155impl Token for DecodedWord<'_> {
156    fn text(&self) -> &str {
157        self.text
158    }
159
160    fn text_lowercase(&self) -> &str {
161        self.text
162    }
163
164    fn nt_separated(&self, previous: &Self) -> bool {
165        // if there is a voice pause of more than 100ms between words, it is worth a punctuation
166        self.start - previous.end > 100
167    }
168
169    fn not_a_number_part(&self) -> bool {
170        false
171    }
172}
173
174
175// Simulate real time (online) ASR output
176
177let stream = [
178    DecodedWord{ text: "i", start: 0, end: 100},
179    DecodedWord{ text: "have", start: 100, end: 200},
180    DecodedWord{ text: "twenty", start: 200, end: 300},
181    DecodedWord{ text: "four", start: 300, end: 400},
182    DecodedWord{ text: "dollars", start: 410, end: 800},
183    DecodedWord{ text: "in", start: 800, end: 900},
184    DecodedWord{ text: "my", start: 900, end: 1000},
185    DecodedWord{ text: "pocket", start: 1010, end: 1410},
186].into_iter();
187
188// Process
189
190let en = Language::english();
191
192let occurences = find_numbers(stream, &en, 10.0);
193
194assert_eq!(occurences.len(), 1);
195
196let found = &occurences[0];
197
198// Match position in the stream
199assert_eq!(found.start, 2);
200assert_eq!(found.end, 4);
201// Match values
202assert_eq!(found.text, "24");
203assert_eq!(found.value, 24.0);
204assert!(!found.is_ordinal);
205```
206
207
208*/
209
210pub mod digit_string;
211pub mod error;
212pub mod lang;
213mod tokenizer;
214pub mod word_to_digit;
215
216pub use lang::{BasicAnnotate, LangInterpreter, Language};
217pub use word_to_digit::{
218    Occurence, Replace, Token, find_numbers, find_numbers_iter, replace_numbers_in_stream,
219    replace_numbers_in_text, text2digits,
220};
221
222/// Get an interpreter for the language represented by the `language_code` ISO code.
223pub fn get_interpreter_for(language_code: &str) -> Option<Language> {
224    match language_code {
225        "de" => Some(Language::german()),
226        "en" => Some(Language::english()),
227        "es" => Some(Language::spanish()),
228        "fr" => Some(Language::french()),
229        "it" => Some(Language::italian()),
230        "nl" => Some(Language::dutch()),
231        "pt" => Some(Language::portuguese()),
232        _ => None,
233    }
234}
235
236#[cfg(test)]
237mod tests {
238    use super::{Language, replace_numbers_in_text};
239
240    #[test]
241    fn test_access_fr() {
242        let french = Language::french();
243        assert_eq!(
244            replace_numbers_in_text(
245                "Pour la cinquième fois: vingt-cinq plus quarante-huit égalent soixante-treize",
246                &french,
247                0.0
248            ),
249            "Pour la 5ème fois: 25 plus 48 égalent 73"
250        );
251    }
252
253    #[test]
254    fn test_zeros_fr() {
255        let french = Language::french();
256        assert_eq!(
257            replace_numbers_in_text("zéro zéro trente quatre vingt", &french, 10.),
258            "0034 20"
259        );
260    }
261
262    #[test]
263    fn test_access_en() {
264        let english = Language::english();
265        assert_eq!(
266            replace_numbers_in_text(
267                "For the fifth time: twenty-five plus fourty-eight equal seventy-three",
268                &english,
269                0.0
270            ),
271            "For the 5th time: 25 plus 48 equal 73"
272        );
273    }
274}