Skip to main content

Module normalizer

Module normalizer 

Source
Expand description

Text normalisation for worker demographic data.

Research on worker identification (see spec.md §5) is unanimous: most accuracy gains come from standardising the input before scoring, not from cleverer similarity algorithms. This module exposes the canonical transformations the matching engine applies to names, postcodes, phone numbers, and phonetic codes.

All transformations are idempotent: f(f(x)) == f(x). They are also deterministic and allocate at most a single new String.

§Quick examples

use worker_matcher::Normalizer;

// Names: lowercase, drop diacritics, drop ASCII punctuation, collapse spaces.
assert_eq!(Normalizer::normalize_name("  O'Brien  "), "obrien");
assert_eq!(Normalizer::normalize_name("Siân"),         "sian");

// Postcodes: strip whitespace, uppercase.
assert_eq!(Normalizer::normalize_postcode("cf10 1aa"), "CF101AA");

// Phone numbers: keep digits, strip international and trunk prefixes.
assert_eq!(Normalizer::normalize_phone("+44 7700 900123"), "7700900123");

§What this module deliberately does not do

  • It does not validate NHS numbers — that is delegated to the nhs-number crate at the call-site (see crate::matcher).
  • It does not normalise email addresses or middle names (see spec tasks T-11 and OQ-1 respectively).
  • It does not handle non-ASCII punctuation such as the curly apostrophe (U+2019). Upstream code should convert those to ASCII first.

§International phone numbers

Two phone normalisers are provided:

  • Normalizer::normalize_phone — UK-centric national-significant form, suitable for legacy or single-jurisdiction call-sites. Idempotent and infallible.
  • Normalizer::normalize_phone_e164 — international-aware E.164 form (+CCNNNN…) for jurisdictions in the supported country table. Returns None if the input cannot be confidently parsed.

The matching engine tries E.164 first and falls back to the legacy form when either input is unparseable, so existing single-country deployments observe the same behaviour while multinational deployments gain cross-country disambiguation (a French number and a UK number that share the same trunk digits no longer collide).

Structs§

Normalizer
Stateless namespace for text normalisation routines.
ParsedAddressLine
Structured decomposition of a postal-address line.