Expand description
Word-number recognition and substitution.
This module converts English written numbers (e.g. "twenty-three",
"one thousand nine hundred eighty-four") into their digit equivalents,
allowing the tokeniser to treat them the same as numerals.
§Approach
The public entry point is replace_word_numbers, which scans an
utterance for the longest contiguous word-number span it can parse and
replaces it with the decimal representation. Multiple non-overlapping
spans are replaced left-to-right.
Each individual word is fuzzy-matched against the canonical English
number vocabulary using crate::levenshtein::levenshtein_ratio. This
lets the module tolerate common typos, repeated characters, transpositions,
and phonetic spelling patterns from non-English speakers (see test suite).
Ordinal forms ("first", "twenty-third", etc.) are included in the
vocabulary so they parse identically to their cardinal equivalents.
Common English stop words ("the", "of", "and", etc.) are explicitly
excluded so they cannot produce false-positive number matches.
§Supported range
1 – 3000, covering every value that is meaningful as a day (1–31), month (1–12), or year (1–3000) in the date extraction context.
§Grammar
number ::= thousands? hundreds? tens_units
thousands ::= unit "thousand"
hundreds ::= unit "hundred"
tens_units ::= tens unit? (e.g. "twenty", "twenty-one", "twenty-third")
| teen (e.g. "fourteenth")
| unit (e.g. "seventh")
| (empty)Hyphenated compound words ("twenty-one") are split on - before
individual word matching so the hyphen is treated as a separator.
Functions§
- replace_
word_ numbers - Scan
utterancefor word-number spans and replace each with its decimal representation, returning the modified string.