text2num/lib.rs
1/*!
2This crate provides a library for recognizing, parsing and transcribing into digits (base 10) numbers expressed in natural language.
3
4This top level documentation describes the usage of the library with the builtin languages and provides some simple examples.
5
6For more specific details on how to add support for new natural languages (and contributing to the builtin set!), please see the documentation of the [`lang`] module.
7
8
9# Usage
10
11This crate is [on crates.io](https://crates.io/crates/text2num) and can be
12used by adding `text2num` to your dependencies in your project's `Cargo.toml`.
13
14```toml
15[dependencies]
16text2num = "1"
17```
18
19# Example: check some string is a valid number in a given language.
20
21For convenience, the builtin languages are encapsulated into the [`Language`] type so
22you can easily switch languages at runtime.
23
24Each builtin language support regional varieties automatically, so you don't need to specify a region.
25
26The language interpreters are stateless so you can reuse and share them.
27
28```rust
29use text2num::{Language, text2digits};
30
31let en = Language::english();
32
33assert!(
34 text2digits("one hundred fifty-seven", &en).is_ok()
35);
36
37assert!(text2digits("twenty twelve", &en).is_err());
38```
39
40Of course, you can get the base 10 digit representation too:
41
42```rust
43use text2num::{Language, text2digits};
44
45let es = Language::spanish();
46let utterance = "ochenta y cinco";
47
48match text2digits(utterance, &es) {
49 Ok(repr) => println!("'{}' means {} in Spanish", utterance, repr),
50 Err(_) => println!("'{}' is not a number in Spanish", utterance)
51}
52```
53
54When run, the above code should print `'ochenta y cinco' means 85 in Spanish` on the standard output.
55
56If you don't need to dynamically switch languages, you can directly use the appropriate interpreter instead of
57the `Language` type:
58
59```
60use text2num::lang::English;
61use text2num::text2digits;
62
63let en = English::new();
64
65assert!(text2digits("fifty-five", &en).is_ok());
66```
67
68# Example: find and replace numbers in a natural speech string.
69
70Most often, you just want to rewrite a string containing natural speech so that the numbers it contains (cardinals,
71ordinals, decimal numbers) appear in digit (base 10) form instead.
72
73As isolated smaller numbers may be easier to read in plain text, you can specify a threshold under which isolated simple cardinals and ordinals are
74not replaced.
75
76```rust
77use text2num::{Language, replace_numbers_in_text};
78
79let en = Language::english();
80
81let sentence = "Let me show you two things: first, isolated numbers are treated differently than groups like one, two, three. And then, that decimal numbers like three point one four one five are well understood.";
82
83assert_eq!(
84 replace_numbers_in_text(sentence, &en, 10.0),
85 "Let me show you two things: first, isolated numbers are treated differently than groups like 1, 2, 3. And then, that decimal numbers like 3.1415 are well understood."
86);
87
88assert_eq!(
89 replace_numbers_in_text(sentence, &en, 0.0),
90 "Let me show you 2 things: 1st, isolated numbers are treated differently than groups like 1, 2, 3. And then, that decimal numbers like 3.1415 are well understood."
91);
92```
93
94# More advanced usage: operations on token streams.
95
96Among the real life applications of this library are the post-processing of Automatic Speech Recognition (ASR)
97output or taking part in a Natural Language Processing (NLP) pipeline.
98
99In those cases, you'll probably get a stream of tokens of a certain type instead of a string.
100The `text2num` library can process those streams as long as the token type implements the [`Token trait`](word_to_digit::Token).
101
102
103# Example: substitutions in a token list.
104
105We can show a simple example with `String` streams:
106
107```rust
108use text2num::{replace_numbers_in_stream, Language, Token, Replace};
109
110let en = Language::english();
111
112struct BareToken(String);
113
114impl Token for &BareToken {
115 fn text(&self) -> &str {
116 self.0.as_ref()
117 }
118
119 fn text_lowercase(&self) -> &str {
120 self.0.as_ref()
121 }
122}
123
124impl Replace for BareToken {
125 fn replace<I: Iterator<Item = Self>>(_replaced: I, data: String) -> Self {
126 BareToken(data)
127 }
128}
129
130// Poor man's tokenizer
131let token_list = "I have two hundreds and twenty dollars in my pocket".split_whitespace().map(|s| BareToken(s.to_string())).collect();
132
133let processed_stream = replace_numbers_in_stream(token_list, &en, 10.0);
134
135assert_eq!(
136 processed_stream.into_iter().map(|t| t.0).collect::<Vec<_>>(),
137 vec!["I", "have", "220", "dollars", "in", "my", "pocket"]
138);
139```
140
141# Example: find numbers in a token stream.
142
143In this more elaborate example, we show how to implement the `Token` trait on a typical ASR token type and
144how to locate numbers (and their values) in a stream of those tokens.
145
146```rust
147use text2num::{find_numbers, Language, Token};
148
149struct DecodedWord<'a> {
150 text: &'a str,
151 start: u64, // in milliseconds
152 end: u64
153}
154
155impl Token for DecodedWord<'_> {
156 fn text(&self) -> &str {
157 self.text
158 }
159
160 fn text_lowercase(&self) -> &str {
161 self.text
162 }
163
164 fn nt_separated(&self, previous: &Self) -> bool {
165 // if there is a voice pause of more than 100ms between words, it is worth a punctuation
166 self.start - previous.end > 100
167 }
168
169 fn not_a_number_part(&self) -> bool {
170 false
171 }
172}
173
174
175// Simulate real time (online) ASR output
176
177let stream = [
178 DecodedWord{ text: "i", start: 0, end: 100},
179 DecodedWord{ text: "have", start: 100, end: 200},
180 DecodedWord{ text: "twenty", start: 200, end: 300},
181 DecodedWord{ text: "four", start: 300, end: 400},
182 DecodedWord{ text: "dollars", start: 410, end: 800},
183 DecodedWord{ text: "in", start: 800, end: 900},
184 DecodedWord{ text: "my", start: 900, end: 1000},
185 DecodedWord{ text: "pocket", start: 1010, end: 1410},
186].into_iter();
187
188// Process
189
190let en = Language::english();
191
192let occurences = find_numbers(stream, &en, 10.0);
193
194assert_eq!(occurences.len(), 1);
195
196let found = &occurences[0];
197
198// Match position in the stream
199assert_eq!(found.start, 2);
200assert_eq!(found.end, 4);
201// Match values
202assert_eq!(found.text, "24");
203assert_eq!(found.value, 24.0);
204assert!(!found.is_ordinal);
205```
206
207
208*/
209
210pub mod digit_string;
211pub mod error;
212pub mod lang;
213mod tokenizer;
214pub mod word_to_digit;
215
216pub use lang::{BasicAnnotate, LangInterpreter, Language};
217pub use word_to_digit::{
218 Occurence, Replace, Token, find_numbers, find_numbers_iter, replace_numbers_in_stream,
219 replace_numbers_in_text, text2digits,
220};
221
222/// Get an interpreter for the language represented by the `language_code` ISO code.
223pub fn get_interpreter_for(language_code: &str) -> Option<Language> {
224 match language_code {
225 "de" => Some(Language::german()),
226 "en" => Some(Language::english()),
227 "es" => Some(Language::spanish()),
228 "fr" => Some(Language::french()),
229 "it" => Some(Language::italian()),
230 "nl" => Some(Language::dutch()),
231 "pt" => Some(Language::portuguese()),
232 _ => None,
233 }
234}
235
236#[cfg(test)]
237mod tests {
238 use super::{Language, replace_numbers_in_text};
239
240 #[test]
241 fn test_access_fr() {
242 let french = Language::french();
243 assert_eq!(
244 replace_numbers_in_text(
245 "Pour la cinquième fois: vingt-cinq plus quarante-huit égalent soixante-treize",
246 &french,
247 0.0
248 ),
249 "Pour la 5ème fois: 25 plus 48 égalent 73"
250 );
251 }
252
253 #[test]
254 fn test_zeros_fr() {
255 let french = Language::french();
256 assert_eq!(
257 replace_numbers_in_text("zéro zéro trente quatre vingt", &french, 10.),
258 "0034 20"
259 );
260 }
261
262 #[test]
263 fn test_access_en() {
264 let english = Language::english();
265 assert_eq!(
266 replace_numbers_in_text(
267 "For the fifth time: twenty-five plus fourty-eight equal seventy-three",
268 &english,
269 0.0
270 ),
271 "For the 5th time: 25 plus 48 equal 73"
272 );
273 }
274}