pub struct Normalizer;Expand description
Stateless namespace for text normalisation routines.
Normalizer is a unit type with no fields; every method is associated.
It is held as a struct rather than a free function module purely so the
public API has a single, discoverable entry point.
use worker_matcher::Normalizer;
let canonical = Normalizer::normalize_name("José-María");
assert_eq!(canonical, "josemaria");Implementations§
Source§impl Normalizer
impl Normalizer
Sourcepub fn normalize_name(name: &str) -> String
pub fn normalize_name(name: &str) -> String
Normalise a human name for comparison.
Steps, in order:
- Decompose to Unicode NFKD form (
é→e+ combining acute). - Drop combining marks (diacritics).
- Drop ASCII punctuation (apostrophes, hyphens, full stops, …).
- Lowercase.
- Collapse consecutive whitespace to single ASCII spaces; trim ends.
The result is suitable for direct equality comparison or for feeding into a string-similarity scorer.
§Examples
Whitespace is collapsed and trimmed:
use worker_matcher::Normalizer;
assert_eq!(Normalizer::normalize_name(" John Smith "), "john smith");Apostrophes and hyphens are stripped:
assert_eq!(Normalizer::normalize_name("O'Brien"), "obrien");
assert_eq!(Normalizer::normalize_name("MARY-JANE"), "maryjane");Diacritics are removed:
assert_eq!(Normalizer::normalize_name("José"), "jose");
assert_eq!(Normalizer::normalize_name("Siân"), "sian");
assert_eq!(Normalizer::normalize_name("Łukasz"), "łukasz"); // ł has no decompositionEmpty and whitespace-only input round-trip cleanly:
assert_eq!(Normalizer::normalize_name(""), "");
assert_eq!(Normalizer::normalize_name(" "), "");The function is idempotent:
let once = Normalizer::normalize_name(" José-María ");
let twice = Normalizer::normalize_name(&once);
assert_eq!(once, twice);Sourcepub fn normalize_postcode(postcode: &str) -> String
pub fn normalize_postcode(postcode: &str) -> String
Normalise a postcode for comparison.
Steps: drop all whitespace, then uppercase. No locale-specific validation — that is intentionally out of scope.
§Examples
UK postcodes with and without the conventional space are equivalent:
use worker_matcher::Normalizer;
assert_eq!(Normalizer::normalize_postcode("CF10 1AA"), "CF101AA");
assert_eq!(Normalizer::normalize_postcode("cf101aa"), "CF101AA");
assert_eq!(Normalizer::normalize_postcode(" cf10 1aa "), "CF101AA");Empty input is preserved:
assert_eq!(Normalizer::normalize_postcode(""), "");Idempotent:
let once = Normalizer::normalize_postcode("sw1a 2aa");
let twice = Normalizer::normalize_postcode(&once);
assert_eq!(once, twice);Sourcepub fn normalize_phone(phone: &str) -> String
pub fn normalize_phone(phone: &str) -> String
Normalise a phone number for comparison.
Steps:
- Keep only ASCII digits (drop spaces, brackets, hyphens,
+, …). - If the result starts with
0044, drop those four characters. - Else, if the result starts with
44and is at least 12 digits long, drop the leading44. - Else, if the result starts with
0and is longer than one digit, drop the leading0.
This canonicalises the common UK formats into a single subscriber number with no leading prefix. International numbers from other countries pass through unchanged.
§Examples
use worker_matcher::Normalizer;
// UK mobile, in three formats:
assert_eq!(Normalizer::normalize_phone("07700 900123"), "7700900123");
assert_eq!(Normalizer::normalize_phone("+44 7700 900123"), "7700900123");
assert_eq!(Normalizer::normalize_phone("0044 7700 900123"), "7700900123");
// UK landline with brackets and spaces:
assert_eq!(Normalizer::normalize_phone("(029) 2034 5678"), "2920345678");
// Empty input is preserved (no digits to keep):
assert_eq!(Normalizer::normalize_phone(""), "");Idempotent on canonical inputs:
let once = Normalizer::normalize_phone("07700 900123");
let twice = Normalizer::normalize_phone(&once);
assert_eq!(once, twice);Sourcepub fn normalize_phone_e164(
phone: &str,
default_country: Option<&str>,
) -> Option<String>
pub fn normalize_phone_e164( phone: &str, default_country: Option<&str>, ) -> Option<String>
Normalise a phone number to its E.164-style canonical form.
E.164 is the ITU-T standard for international telephone numbers and
has the shape +CCNNN…, where CC is the country dialling code
(1–3 digits) and the remainder is the national-significant number
(NSN) with no trunk prefix.
The function accepts a wide range of textual layouts:
+CC…(explicit international, the canonical input form).00CC…(international access code, common across Europe).0…(national format, trunk-prefix) — interpreted relative todefault_countrywhen the country uses a national trunk0.NSN…(bare national-significant number) — interpreted relative todefault_country.
Returns Some(canonical) if the input parses against a country in
the supported table; otherwise None. The supported countries are
the five jurisdictions for which the crate exposes a national
healthcare identifier (United Kingdom, France, Spain, Ireland, and
— sharing the GB dial code — UK Northern Ireland), plus the most
common worker-mobility partners (US, CA, DE, IT, NL, BE, PT, CH,
AT, SE, NO, DK, FI, PL, AU, NZ, JP, CN, IN, BR, MX, ZA). default_country is the
ISO 3166-1 alpha-2 code (e.g. "GB", "FR", "US") of the
jurisdiction whose national format applies when the input lacks an
explicit international marker. Pass None to refuse to assume a
default — only explicit +CC / 00CC inputs will parse.
The function is deterministic and idempotent: feeding a
canonical +CCNNN… string back in returns the same string.
§Examples
UK mobile, three textual layouts, all canonicalise to the same E.164 form:
use worker_matcher::Normalizer;
assert_eq!(
Normalizer::normalize_phone_e164("+44 7700 900123", Some("GB")),
Some("+447700900123".to_string()),
);
assert_eq!(
Normalizer::normalize_phone_e164("0044 7700 900123", Some("GB")),
Some("+447700900123".to_string()),
);
assert_eq!(
Normalizer::normalize_phone_e164("07700 900123", Some("GB")),
Some("+447700900123".to_string()),
);French national format vs international form:
assert_eq!(
Normalizer::normalize_phone_e164("01 23 45 67 89", Some("FR")),
Some("+33123456789".to_string()),
);
assert_eq!(
Normalizer::normalize_phone_e164("+33 1 23 45 67 89", Some("GB")),
Some("+33123456789".to_string()),
);North American (NANP) numbers have no trunk prefix:
assert_eq!(
Normalizer::normalize_phone_e164("(415) 555-1234", Some("US")),
Some("+14155551234".to_string()),
);
assert_eq!(
Normalizer::normalize_phone_e164("+1 415 555 1234", None),
Some("+14155551234".to_string()),
);Unparseable or ambiguous inputs return None:
// No default country and no international marker: ambiguous.
assert_eq!(Normalizer::normalize_phone_e164("07700 900123", None), None);
// Unknown dial code.
assert_eq!(Normalizer::normalize_phone_e164("+999 1234567", None), None);
// Empty input.
assert_eq!(Normalizer::normalize_phone_e164("", Some("GB")), None);Idempotent on canonical inputs:
let once = Normalizer::normalize_phone_e164("+44 7700 900123", Some("GB")).unwrap();
let twice = Normalizer::normalize_phone_e164(&once, Some("GB")).unwrap();
assert_eq!(once, twice);Sourcepub fn expand_street_abbreviations(line: &str) -> String
pub fn expand_street_abbreviations(line: &str) -> String
Expand common postal address abbreviations as whole tokens.
The input is tokenised on whitespace and each token is matched
case-insensitively (after stripping a single trailing . or ,)
against a fixed table of street-type and directional abbreviations.
Recognised tokens are replaced with their long form, lowercased;
unrecognised tokens are passed through verbatim. Tokens are then
re-joined by single spaces.
This function is intentionally simple: it does not apply any
position-aware heuristics. The well-known ambiguous case "St" —
which can mean Street or Saint — is always expanded to
Street. In practice this remains useful for fuzzy matching
because the canonical form is consistent on both sides of a
comparison; pre-process upstream if you need finer disambiguation.
§Examples
use worker_matcher::Normalizer;
assert_eq!(
Normalizer::expand_street_abbreviations("123 High St"),
"123 High street",
);
assert_eq!(
Normalizer::expand_street_abbreviations("45 N. Park Ave."),
"45 north Park avenue",
);
assert_eq!(
Normalizer::expand_street_abbreviations("12 Sunset Blvd"),
"12 Sunset boulevard",
);Idempotent on already-expanded inputs (long forms are not re-expanded):
let once = Normalizer::expand_street_abbreviations("10 Downing St");
let twice = Normalizer::expand_street_abbreviations(&once);
assert_eq!(once, twice);Sourcepub fn normalize_address_line(line: &str) -> String
pub fn normalize_address_line(line: &str) -> String
Normalise an address line for comparison.
Pipeline:
- Expand street-type and directional abbreviations via
Normalizer::expand_street_abbreviations(so"St" → "street","Rd" → "road","N" → "north"). - Apply the name-normalisation pipeline
(
Normalizer::normalize_name): NFKD-decompose, drop combining marks, drop ASCII punctuation, lowercase, collapse whitespace.
The result is idempotent and suitable for direct equality or similarity comparison.
§Examples
Abbreviated and full forms canonicalise identically:
use worker_matcher::Normalizer;
assert_eq!(
Normalizer::normalize_address_line("123 High St"),
Normalizer::normalize_address_line("123 High Street"),
);
assert_eq!(
Normalizer::normalize_address_line("45 N Park Ave"),
Normalizer::normalize_address_line("45 North Park Avenue"),
);Punctuation and case are normalised:
assert_eq!(
Normalizer::normalize_address_line("10, DOWNING Street."),
"10 downing street",
);Sourcepub fn parse_address_line(line: &str) -> ParsedAddressLine
pub fn parse_address_line(line: &str) -> ParsedAddressLine
Parse an address line into its structured components.
The function performs a best-effort structural decomposition of a single-line postal address into:
house_number— the leading run of digits (with an optional single alphabetic suffix, e.g."10A"), uppercased.Noneif no leading number is present.unit— a recognised sub-unit prefix (Flat,Apt,Apartment,Unit,Suite,Ste) and its identifier, lowercased and space-joined (e.g."flat 2a").Noneif no recognised prefix is present.street— the remaining text afterunitandhouse_numberare removed, run throughNormalizer::normalize_address_line.
Parsing is deterministic and format-only — no postal
reference is consulted. Inputs that do not match the simple
regular structure (e.g. a postcode-only string, a city name)
degrade gracefully: house_number and unit are None, and
street carries the normalised input.
§Examples
Typical UK / US single-line addresses:
use worker_matcher::Normalizer;
let p = Normalizer::parse_address_line("123 High Street");
assert_eq!(p.house_number.as_deref(), Some("123"));
assert_eq!(p.unit, None);
assert_eq!(p.street, "high street");
let p = Normalizer::parse_address_line("10A Downing St");
assert_eq!(p.house_number.as_deref(), Some("10A"));
assert_eq!(p.street, "downing street");
let p = Normalizer::parse_address_line("Flat 2A, 10 Downing Street");
assert_eq!(p.unit.as_deref(), Some("flat 2a"));
assert_eq!(p.house_number.as_deref(), Some("10"));
assert_eq!(p.street, "downing street");
let p = Normalizer::parse_address_line("Apt 5, 1600 Pennsylvania Ave");
assert_eq!(p.unit.as_deref(), Some("apt 5"));
assert_eq!(p.house_number.as_deref(), Some("1600"));
assert_eq!(p.street, "pennsylvania avenue");Inputs without a leading number still parse:
let p = Normalizer::parse_address_line("Buckingham Palace");
assert_eq!(p.house_number, None);
assert_eq!(p.unit, None);
assert_eq!(p.street, "buckingham palace");Sourcepub fn phonetic_code(name: &str) -> String
pub fn phonetic_code(name: &str) -> String
Compute a phonetic (Soundex) code for a name.
Internally, the input is first normalised via
Normalizer::normalize_name and then encoded with the American
Soundex algorithm. Names that sound alike map to the same code, which
lets the matcher catch spelling variants such as “Smith” / “Smyth” or
“Stephen” / “Steven”.
The implementation is suitable for English-language names. Non-English
phonemes may be lost. T-9 (spec §21.4) decided to keep Soundex as the
default and expose an opt-in MatchConfig::phonetic_encoder enum
(Double Metaphone, Daitch-Mokotoff) gated behind a Cargo feature flag
once an empirical multinational worker corpus is available;
implementation is tracked as T-9.1.
§Examples
Similar-sounding spellings share a code:
use worker_matcher::Normalizer;
assert_eq!(Normalizer::phonetic_code("Smith"), Normalizer::phonetic_code("Smyth"));
assert_eq!(Normalizer::phonetic_code("Stephen"), Normalizer::phonetic_code("Steven"));Different families produce different codes:
assert_ne!(Normalizer::phonetic_code("Jones"), Normalizer::phonetic_code("Smith"));Empty input returns an empty string, not a default Soundex value:
assert_eq!(Normalizer::phonetic_code(""), "");
assert_eq!(Normalizer::phonetic_code(" "), "");Sourcepub fn normalize_email(email: &str, gmail_dot_folding: bool) -> Option<String>
pub fn normalize_email(email: &str, gmail_dot_folding: bool) -> Option<String>
Normalise an email address for comparison.
Steps:
- Trim surrounding whitespace.
- Lowercase the entire address (RFC 5321 makes the domain case-insensitive and most real-world deployments treat the localpart case-insensitively too; case-sensitive localparts are technically legal but vanishingly rare in healthcare data).
- Reject inputs that lack exactly one
@or that have an empty localpart or domain by returningNone. - If
gmail_dot_foldingistrueand the domain isgmail.comorgooglemail.com, strip every.from the localpart and drop any+tagsuffix. Both transformations are reversible for Gmail addresses by Google’s documented routing rules:j.smith@gmail.com,js.mith@gmail.com, andjsmith+work@gmail.comall deliver to the same mailbox asjsmith@gmail.com.
The function is deterministic and idempotent on successful outputs.
§Examples
Common case-and-whitespace normalisation:
use worker_matcher::Normalizer;
assert_eq!(
Normalizer::normalize_email(" Alice@Example.ORG ", false),
Some("alice@example.org".to_string()),
);Malformed inputs return None:
assert_eq!(Normalizer::normalize_email("no-at-sign", false), None);
assert_eq!(Normalizer::normalize_email("@example.org", false), None);
assert_eq!(Normalizer::normalize_email("alice@", false), None);
assert_eq!(Normalizer::normalize_email("a@b@c", false), None);
assert_eq!(Normalizer::normalize_email("", false), None);Optional Gmail dot-folding:
assert_eq!(
Normalizer::normalize_email("j.smith@gmail.com", true),
Some("jsmith@gmail.com".to_string()),
);
assert_eq!(
Normalizer::normalize_email("jsmith+work@googlemail.com", true),
Some("jsmith@googlemail.com".to_string()),
);
// Dot-folding does not touch non-Gmail addresses.
assert_eq!(
Normalizer::normalize_email("j.smith@example.org", true),
Some("j.smith@example.org".to_string()),
);Idempotent on canonical inputs:
let once = Normalizer::normalize_email("Alice@Example.ORG", false).unwrap();
let twice = Normalizer::normalize_email(&once, false).unwrap();
assert_eq!(once, twice);Auto Trait Implementations§
impl Freeze for Normalizer
impl RefUnwindSafe for Normalizer
impl Send for Normalizer
impl Sync for Normalizer
impl Unpin for Normalizer
impl UnsafeUnpin for Normalizer
impl UnwindSafe for Normalizer
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more