Skip to main content

Normalizer

Struct Normalizer 

Source
pub struct Normalizer;
Expand description

Stateless namespace for text normalisation routines.

Normalizer is a unit type with no fields; every method is associated. It is held as a struct rather than a free function module purely so the public API has a single, discoverable entry point.

use worker_matcher::Normalizer;

let canonical = Normalizer::normalize_name("José-María");
assert_eq!(canonical, "josemaria");

Implementations§

Source§

impl Normalizer

Source

pub fn normalize_name(name: &str) -> String

Normalise a human name for comparison.

Steps, in order:

  1. Decompose to Unicode NFKD form (ée + combining acute).
  2. Drop combining marks (diacritics).
  3. Drop ASCII punctuation (apostrophes, hyphens, full stops, …).
  4. Lowercase.
  5. Collapse consecutive whitespace to single ASCII spaces; trim ends.

The result is suitable for direct equality comparison or for feeding into a string-similarity scorer.

§Examples

Whitespace is collapsed and trimmed:

use worker_matcher::Normalizer;
assert_eq!(Normalizer::normalize_name("  John  Smith  "), "john smith");

Apostrophes and hyphens are stripped:

assert_eq!(Normalizer::normalize_name("O'Brien"),    "obrien");
assert_eq!(Normalizer::normalize_name("MARY-JANE"),  "maryjane");

Diacritics are removed:

assert_eq!(Normalizer::normalize_name("José"),  "jose");
assert_eq!(Normalizer::normalize_name("Siân"),  "sian");
assert_eq!(Normalizer::normalize_name("Łukasz"), "łukasz");  // ł has no decomposition

Empty and whitespace-only input round-trip cleanly:

assert_eq!(Normalizer::normalize_name(""),       "");
assert_eq!(Normalizer::normalize_name("    "),   "");

The function is idempotent:

let once = Normalizer::normalize_name("  José-María  ");
let twice = Normalizer::normalize_name(&once);
assert_eq!(once, twice);
Source

pub fn normalize_postcode(postcode: &str) -> String

Normalise a postcode for comparison.

Steps: drop all whitespace, then uppercase. No locale-specific validation — that is intentionally out of scope.

§Examples

UK postcodes with and without the conventional space are equivalent:

use worker_matcher::Normalizer;
assert_eq!(Normalizer::normalize_postcode("CF10 1AA"),    "CF101AA");
assert_eq!(Normalizer::normalize_postcode("cf101aa"),     "CF101AA");
assert_eq!(Normalizer::normalize_postcode("  cf10 1aa "), "CF101AA");

Empty input is preserved:

assert_eq!(Normalizer::normalize_postcode(""), "");

Idempotent:

let once = Normalizer::normalize_postcode("sw1a 2aa");
let twice = Normalizer::normalize_postcode(&once);
assert_eq!(once, twice);
Source

pub fn normalize_phone(phone: &str) -> String

Normalise a phone number for comparison.

Steps:

  1. Keep only ASCII digits (drop spaces, brackets, hyphens, +, …).
  2. If the result starts with 0044, drop those four characters.
  3. Else, if the result starts with 44 and is at least 12 digits long, drop the leading 44.
  4. Else, if the result starts with 0 and is longer than one digit, drop the leading 0.

This canonicalises the common UK formats into a single subscriber number with no leading prefix. International numbers from other countries pass through unchanged.

§Examples
use worker_matcher::Normalizer;

// UK mobile, in three formats:
assert_eq!(Normalizer::normalize_phone("07700 900123"),    "7700900123");
assert_eq!(Normalizer::normalize_phone("+44 7700 900123"), "7700900123");
assert_eq!(Normalizer::normalize_phone("0044 7700 900123"), "7700900123");

// UK landline with brackets and spaces:
assert_eq!(Normalizer::normalize_phone("(029) 2034 5678"), "2920345678");

// Empty input is preserved (no digits to keep):
assert_eq!(Normalizer::normalize_phone(""), "");

Idempotent on canonical inputs:

let once = Normalizer::normalize_phone("07700 900123");
let twice = Normalizer::normalize_phone(&once);
assert_eq!(once, twice);
Source

pub fn normalize_phone_e164( phone: &str, default_country: Option<&str>, ) -> Option<String>

Normalise a phone number to its E.164-style canonical form.

E.164 is the ITU-T standard for international telephone numbers and has the shape +CCNNN…, where CC is the country dialling code (1–3 digits) and the remainder is the national-significant number (NSN) with no trunk prefix.

The function accepts a wide range of textual layouts:

  • +CC… (explicit international, the canonical input form).
  • 00CC… (international access code, common across Europe).
  • 0… (national format, trunk-prefix) — interpreted relative to default_country when the country uses a national trunk 0.
  • NSN… (bare national-significant number) — interpreted relative to default_country.

Returns Some(canonical) if the input parses against a country in the supported table; otherwise None. The supported countries are the five jurisdictions for which the crate exposes a national healthcare identifier (United Kingdom, France, Spain, Ireland, and — sharing the GB dial code — UK Northern Ireland), plus the most common worker-mobility partners (US, CA, DE, IT, NL, BE, PT, CH, AT, SE, NO, DK, FI, PL, AU, NZ, JP, CN, IN, BR, MX, ZA). default_country is the ISO 3166-1 alpha-2 code (e.g. "GB", "FR", "US") of the jurisdiction whose national format applies when the input lacks an explicit international marker. Pass None to refuse to assume a default — only explicit +CC / 00CC inputs will parse.

The function is deterministic and idempotent: feeding a canonical +CCNNN… string back in returns the same string.

§Examples

UK mobile, three textual layouts, all canonicalise to the same E.164 form:

use worker_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_phone_e164("+44 7700 900123", Some("GB")),
    Some("+447700900123".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("0044 7700 900123", Some("GB")),
    Some("+447700900123".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("07700 900123", Some("GB")),
    Some("+447700900123".to_string()),
);

French national format vs international form:

assert_eq!(
    Normalizer::normalize_phone_e164("01 23 45 67 89", Some("FR")),
    Some("+33123456789".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("+33 1 23 45 67 89", Some("GB")),
    Some("+33123456789".to_string()),
);

North American (NANP) numbers have no trunk prefix:

assert_eq!(
    Normalizer::normalize_phone_e164("(415) 555-1234", Some("US")),
    Some("+14155551234".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("+1 415 555 1234", None),
    Some("+14155551234".to_string()),
);

Unparseable or ambiguous inputs return None:

// No default country and no international marker: ambiguous.
assert_eq!(Normalizer::normalize_phone_e164("07700 900123", None), None);
// Unknown dial code.
assert_eq!(Normalizer::normalize_phone_e164("+999 1234567", None), None);
// Empty input.
assert_eq!(Normalizer::normalize_phone_e164("", Some("GB")), None);

Idempotent on canonical inputs:

let once = Normalizer::normalize_phone_e164("+44 7700 900123", Some("GB")).unwrap();
let twice = Normalizer::normalize_phone_e164(&once, Some("GB")).unwrap();
assert_eq!(once, twice);
Source

pub fn expand_street_abbreviations(line: &str) -> String

Expand common postal address abbreviations as whole tokens.

The input is tokenised on whitespace and each token is matched case-insensitively (after stripping a single trailing . or ,) against a fixed table of street-type and directional abbreviations. Recognised tokens are replaced with their long form, lowercased; unrecognised tokens are passed through verbatim. Tokens are then re-joined by single spaces.

This function is intentionally simple: it does not apply any position-aware heuristics. The well-known ambiguous case "St" — which can mean Street or Saint — is always expanded to Street. In practice this remains useful for fuzzy matching because the canonical form is consistent on both sides of a comparison; pre-process upstream if you need finer disambiguation.

§Examples
use worker_matcher::Normalizer;
assert_eq!(
    Normalizer::expand_street_abbreviations("123 High St"),
    "123 High street",
);
assert_eq!(
    Normalizer::expand_street_abbreviations("45 N. Park Ave."),
    "45 north Park avenue",
);
assert_eq!(
    Normalizer::expand_street_abbreviations("12 Sunset Blvd"),
    "12 Sunset boulevard",
);

Idempotent on already-expanded inputs (long forms are not re-expanded):

let once = Normalizer::expand_street_abbreviations("10 Downing St");
let twice = Normalizer::expand_street_abbreviations(&once);
assert_eq!(once, twice);
Source

pub fn normalize_address_line(line: &str) -> String

Normalise an address line for comparison.

Pipeline:

  1. Expand street-type and directional abbreviations via Normalizer::expand_street_abbreviations (so "St" → "street", "Rd" → "road", "N" → "north").
  2. Apply the name-normalisation pipeline (Normalizer::normalize_name): NFKD-decompose, drop combining marks, drop ASCII punctuation, lowercase, collapse whitespace.

The result is idempotent and suitable for direct equality or similarity comparison.

§Examples

Abbreviated and full forms canonicalise identically:

use worker_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_address_line("123 High St"),
    Normalizer::normalize_address_line("123 High Street"),
);
assert_eq!(
    Normalizer::normalize_address_line("45 N Park Ave"),
    Normalizer::normalize_address_line("45 North Park Avenue"),
);

Punctuation and case are normalised:

assert_eq!(
    Normalizer::normalize_address_line("10, DOWNING Street."),
    "10 downing street",
);
Source

pub fn parse_address_line(line: &str) -> ParsedAddressLine

Parse an address line into its structured components.

The function performs a best-effort structural decomposition of a single-line postal address into:

  • house_number — the leading run of digits (with an optional single alphabetic suffix, e.g. "10A"), uppercased. None if no leading number is present.
  • unit — a recognised sub-unit prefix (Flat, Apt, Apartment, Unit, Suite, Ste) and its identifier, lowercased and space-joined (e.g. "flat 2a"). None if no recognised prefix is present.
  • street — the remaining text after unit and house_number are removed, run through Normalizer::normalize_address_line.

Parsing is deterministic and format-only — no postal reference is consulted. Inputs that do not match the simple regular structure (e.g. a postcode-only string, a city name) degrade gracefully: house_number and unit are None, and street carries the normalised input.

§Examples

Typical UK / US single-line addresses:

use worker_matcher::Normalizer;

let p = Normalizer::parse_address_line("123 High Street");
assert_eq!(p.house_number.as_deref(), Some("123"));
assert_eq!(p.unit, None);
assert_eq!(p.street, "high street");

let p = Normalizer::parse_address_line("10A Downing St");
assert_eq!(p.house_number.as_deref(), Some("10A"));
assert_eq!(p.street, "downing street");

let p = Normalizer::parse_address_line("Flat 2A, 10 Downing Street");
assert_eq!(p.unit.as_deref(), Some("flat 2a"));
assert_eq!(p.house_number.as_deref(), Some("10"));
assert_eq!(p.street, "downing street");

let p = Normalizer::parse_address_line("Apt 5, 1600 Pennsylvania Ave");
assert_eq!(p.unit.as_deref(), Some("apt 5"));
assert_eq!(p.house_number.as_deref(), Some("1600"));
assert_eq!(p.street, "pennsylvania avenue");

Inputs without a leading number still parse:

let p = Normalizer::parse_address_line("Buckingham Palace");
assert_eq!(p.house_number, None);
assert_eq!(p.unit, None);
assert_eq!(p.street, "buckingham palace");
Source

pub fn phonetic_code(name: &str) -> String

Compute a phonetic (Soundex) code for a name.

Internally, the input is first normalised via Normalizer::normalize_name and then encoded with the American Soundex algorithm. Names that sound alike map to the same code, which lets the matcher catch spelling variants such as “Smith” / “Smyth” or “Stephen” / “Steven”.

The implementation is suitable for English-language names. Non-English phonemes may be lost. T-9 (spec §21.4) decided to keep Soundex as the default and expose an opt-in MatchConfig::phonetic_encoder enum (Double Metaphone, Daitch-Mokotoff) gated behind a Cargo feature flag once an empirical multinational worker corpus is available; implementation is tracked as T-9.1.

§Examples

Similar-sounding spellings share a code:

use worker_matcher::Normalizer;
assert_eq!(Normalizer::phonetic_code("Smith"), Normalizer::phonetic_code("Smyth"));
assert_eq!(Normalizer::phonetic_code("Stephen"), Normalizer::phonetic_code("Steven"));

Different families produce different codes:

assert_ne!(Normalizer::phonetic_code("Jones"), Normalizer::phonetic_code("Smith"));

Empty input returns an empty string, not a default Soundex value:

assert_eq!(Normalizer::phonetic_code(""),       "");
assert_eq!(Normalizer::phonetic_code("   "),    "");
Source

pub fn normalize_email(email: &str, gmail_dot_folding: bool) -> Option<String>

Normalise an email address for comparison.

Steps:

  1. Trim surrounding whitespace.
  2. Lowercase the entire address (RFC 5321 makes the domain case-insensitive and most real-world deployments treat the localpart case-insensitively too; case-sensitive localparts are technically legal but vanishingly rare in healthcare data).
  3. Reject inputs that lack exactly one @ or that have an empty localpart or domain by returning None.
  4. If gmail_dot_folding is true and the domain is gmail.com or googlemail.com, strip every . from the localpart and drop any +tag suffix. Both transformations are reversible for Gmail addresses by Google’s documented routing rules: j.smith@gmail.com, js.mith@gmail.com, and jsmith+work@gmail.com all deliver to the same mailbox as jsmith@gmail.com.

The function is deterministic and idempotent on successful outputs.

§Examples

Common case-and-whitespace normalisation:

use worker_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_email("  Alice@Example.ORG  ", false),
    Some("alice@example.org".to_string()),
);

Malformed inputs return None:

assert_eq!(Normalizer::normalize_email("no-at-sign", false), None);
assert_eq!(Normalizer::normalize_email("@example.org", false), None);
assert_eq!(Normalizer::normalize_email("alice@", false), None);
assert_eq!(Normalizer::normalize_email("a@b@c", false), None);
assert_eq!(Normalizer::normalize_email("", false), None);

Optional Gmail dot-folding:

assert_eq!(
    Normalizer::normalize_email("j.smith@gmail.com", true),
    Some("jsmith@gmail.com".to_string()),
);
assert_eq!(
    Normalizer::normalize_email("jsmith+work@googlemail.com", true),
    Some("jsmith@googlemail.com".to_string()),
);
// Dot-folding does not touch non-Gmail addresses.
assert_eq!(
    Normalizer::normalize_email("j.smith@example.org", true),
    Some("j.smith@example.org".to_string()),
);

Idempotent on canonical inputs:

let once = Normalizer::normalize_email("Alice@Example.ORG", false).unwrap();
let twice = Normalizer::normalize_email(&once, false).unwrap();
assert_eq!(once, twice);

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.