Skip to main content

Module word_numbers

Module word_numbers 

Source
Expand description

Word-number recognition and substitution.

This module converts English written numbers (e.g. "twenty-three", "one thousand nine hundred eighty-four") into their digit equivalents, allowing the tokeniser to treat them the same as numerals.

§Approach

The public entry point is replace_word_numbers, which scans an utterance for the longest contiguous word-number span it can parse and replaces it with the decimal representation. Multiple non-overlapping spans are replaced left-to-right.

Each individual word is fuzzy-matched against the canonical English number vocabulary using crate::levenshtein::levenshtein_ratio. This lets the module tolerate common typos, repeated characters, transpositions, and phonetic spelling patterns from non-English speakers (see test suite).

Ordinal forms ("first", "twenty-third", etc.) are included in the vocabulary so they parse identically to their cardinal equivalents.

Common English stop words ("the", "of", "and", etc.) are explicitly excluded so they cannot produce false-positive number matches.

§Supported range

1 – 3000, covering every value that is meaningful as a day (1–31), month (1–12), or year (1–3000) in the date extraction context.

§Grammar

number   ::= thousands? hundreds? tens_units
thousands ::= unit "thousand"
hundreds  ::= unit "hundred"
tens_units ::= tens unit?   (e.g. "twenty", "twenty-one", "twenty-third")
             | teen          (e.g. "fourteenth")
             | unit          (e.g. "seventh")
             | (empty)

Hyphenated compound words ("twenty-one") are split on - before individual word matching so the hyphen is treated as a separator.

Functions§

replace_word_numbers
Scan utterance for word-number spans and replace each with its decimal representation, returning the modified string.