Expand description
Text normalisation for worker demographic data.
Research on worker identification (see spec.md §5) is unanimous: most
accuracy gains come from standardising the input before scoring, not
from cleverer similarity algorithms. This module exposes the canonical
transformations the matching engine applies to names, postcodes, phone
numbers, and phonetic codes.
All transformations are idempotent: f(f(x)) == f(x). They are also
deterministic and allocate at most a single new String.
§Quick examples
use worker_matcher::Normalizer;
// Names: lowercase, drop diacritics, drop ASCII punctuation, collapse spaces.
assert_eq!(Normalizer::normalize_name(" O'Brien "), "obrien");
assert_eq!(Normalizer::normalize_name("Siân"), "sian");
// Postcodes: strip whitespace, uppercase.
assert_eq!(Normalizer::normalize_postcode("cf10 1aa"), "CF101AA");
// Phone numbers: keep digits, strip international and trunk prefixes.
assert_eq!(Normalizer::normalize_phone("+44 7700 900123"), "7700900123");§What this module deliberately does not do
- It does not validate NHS numbers — that is delegated to the
nhs-numbercrate at the call-site (seecrate::matcher). - It does not normalise email addresses or middle names (see spec tasks T-11 and OQ-1 respectively).
- It does not handle non-ASCII punctuation such as the curly apostrophe
’(U+2019). Upstream code should convert those to ASCII first.
§International phone numbers
Two phone normalisers are provided:
Normalizer::normalize_phone— UK-centric national-significant form, suitable for legacy or single-jurisdiction call-sites. Idempotent and infallible.Normalizer::normalize_phone_e164— international-aware E.164 form (+CCNNNN…) for jurisdictions in the supported country table. ReturnsNoneif the input cannot be confidently parsed.
The matching engine tries E.164 first and falls back to the legacy form when either input is unparseable, so existing single-country deployments observe the same behaviour while multinational deployments gain cross-country disambiguation (a French number and a UK number that share the same trunk digits no longer collide).
Structs§
- Normalizer
- Stateless namespace for text normalisation routines.
- Parsed
Address Line - Structured decomposition of a postal-address line.