Struct Normalizer

Source

pub struct Normalizer;

Expand description

Stateless namespace for text normalisation routines.

Normalizer is a unit type with no fields; every method is associated. It is held as a struct rather than a free function module purely so the public API has a single, discoverable entry point.

use worker_matcher::Normalizer;

let canonical = Normalizer::normalize_name("José-María");
assert_eq!(canonical, "josemaria");

Implementations§

Source §

impl Normalizer

Source

pub fn normalize_name(name: &str) -> String

Normalise a human name for comparison.

Steps, in order:

Decompose to Unicode NFKD form (é → e + combining acute).
Drop combining marks (diacritics).
Drop ASCII punctuation (apostrophes, hyphens, full stops, …).
Lowercase.
Collapse consecutive whitespace to single ASCII spaces; trim ends.

The result is suitable for direct equality comparison or for feeding into a string-similarity scorer.

§Examples

Whitespace is collapsed and trimmed:

use worker_matcher::Normalizer;
assert_eq!(Normalizer::normalize_name("  John  Smith  "), "john smith");

Apostrophes and hyphens are stripped:

assert_eq!(Normalizer::normalize_name("O'Brien"),    "obrien");
assert_eq!(Normalizer::normalize_name("MARY-JANE"),  "maryjane");

Diacritics are removed:

assert_eq!(Normalizer::normalize_name("José"),  "jose");
assert_eq!(Normalizer::normalize_name("Siân"),  "sian");
assert_eq!(Normalizer::normalize_name("Łukasz"), "łukasz");  // ł has no decomposition

Empty and whitespace-only input round-trip cleanly:

assert_eq!(Normalizer::normalize_name(""),       "");
assert_eq!(Normalizer::normalize_name("    "),   "");

The function is idempotent:

let once = Normalizer::normalize_name("  José-María  ");
let twice = Normalizer::normalize_name(&once);
assert_eq!(once, twice);

Source

pub fn normalize_postcode(postcode: &str) -> String

Normalise a postcode for comparison.

Steps: drop all whitespace, then uppercase. No locale-specific validation — that is intentionally out of scope.

§Examples

UK postcodes with and without the conventional space are equivalent:

use worker_matcher::Normalizer;
assert_eq!(Normalizer::normalize_postcode("CF10 1AA"),    "CF101AA");
assert_eq!(Normalizer::normalize_postcode("cf101aa"),     "CF101AA");
assert_eq!(Normalizer::normalize_postcode("  cf10 1aa "), "CF101AA");

Empty input is preserved:

assert_eq!(Normalizer::normalize_postcode(""), "");

Idempotent:

let once = Normalizer::normalize_postcode("sw1a 2aa");
let twice = Normalizer::normalize_postcode(&once);
assert_eq!(once, twice);

Source

pub fn normalize_phone(phone: &str) -> String

Normalise a phone number for comparison.

Steps:

Keep only ASCII digits (drop spaces, brackets, hyphens, +, …).
If the result starts with 0044, drop those four characters.
Else, if the result starts with 44 and is at least 12 digits long, drop the leading 44.
Else, if the result starts with 0 and is longer than one digit, drop the leading 0.

This canonicalises the common UK formats into a single subscriber number with no leading prefix. International numbers from other countries pass through unchanged.

§Examples

use worker_matcher::Normalizer;

// UK mobile, in three formats:
assert_eq!(Normalizer::normalize_phone("07700 900123"),    "7700900123");
assert_eq!(Normalizer::normalize_phone("+44 7700 900123"), "7700900123");
assert_eq!(Normalizer::normalize_phone("0044 7700 900123"), "7700900123");

// UK landline with brackets and spaces:
assert_eq!(Normalizer::normalize_phone("(029) 2034 5678"), "2920345678");

// Empty input is preserved (no digits to keep):
assert_eq!(Normalizer::normalize_phone(""), "");

Idempotent on canonical inputs:

let once = Normalizer::normalize_phone("07700 900123");
let twice = Normalizer::normalize_phone(&once);
assert_eq!(once, twice);

Source

pub fn normalize_phone_e164( phone: &str, default_country: Option<&str>, ) -> Option<String>

Normalise a phone number to its E.164-style canonical form.

E.164 is the ITU-T standard for international telephone numbers and has the shape +CCNNN…, where CC is the country dialling code (1–3 digits) and the remainder is the national-significant number (NSN) with no trunk prefix.

The function accepts a wide range of textual layouts:

+CC… (explicit international, the canonical input form).
00CC… (international access code, common across Europe).
0… (national format, trunk-prefix) — interpreted relative to default_country when the country uses a national trunk 0.
NSN… (bare national-significant number) — interpreted relative to default_country.

Returns Some(canonical) if the input parses against a country in the supported table; otherwise None. The supported countries are the five jurisdictions for which the crate exposes a national healthcare identifier (United Kingdom, France, Spain, Ireland, and — sharing the GB dial code — UK Northern Ireland), plus the most common worker-mobility partners (US, CA, DE, IT, NL, BE, PT, CH, AT, SE, NO, DK, FI, PL, AU, NZ, JP, CN, IN, BR, MX, ZA). default_country is the ISO 3166-1 alpha-2 code (e.g. "GB", "FR", "US") of the jurisdiction whose national format applies when the input lacks an explicit international marker. Pass None to refuse to assume a default — only explicit +CC / 00CC inputs will parse.

The function is deterministic and idempotent: feeding a canonical +CCNNN… string back in returns the same string.

§Examples

UK mobile, three textual layouts, all canonicalise to the same E.164 form:

use worker_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_phone_e164("+44 7700 900123", Some("GB")),
    Some("+447700900123".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("0044 7700 900123", Some("GB")),
    Some("+447700900123".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("07700 900123", Some("GB")),
    Some("+447700900123".to_string()),
);

French national format vs international form:

assert_eq!(
    Normalizer::normalize_phone_e164("01 23 45 67 89", Some("FR")),
    Some("+33123456789".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("+33 1 23 45 67 89", Some("GB")),
    Some("+33123456789".to_string()),
);

North American (NANP) numbers have no trunk prefix:

assert_eq!(
    Normalizer::normalize_phone_e164("(415) 555-1234", Some("US")),
    Some("+14155551234".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("+1 415 555 1234", None),
    Some("+14155551234".to_string()),
);

Unparseable or ambiguous inputs return None:

// No default country and no international marker: ambiguous.
assert_eq!(Normalizer::normalize_phone_e164("07700 900123", None), None);
// Unknown dial code.
assert_eq!(Normalizer::normalize_phone_e164("+999 1234567", None), None);
// Empty input.
assert_eq!(Normalizer::normalize_phone_e164("", Some("GB")), None);

Idempotent on canonical inputs:

let once = Normalizer::normalize_phone_e164("+44 7700 900123", Some("GB")).unwrap();
let twice = Normalizer::normalize_phone_e164(&once, Some("GB")).unwrap();
assert_eq!(once, twice);

Source

pub fn expand_street_abbreviations(line: &str) -> String

Expand common postal address abbreviations as whole tokens.

The input is tokenised on whitespace and each token is matched case-insensitively (after stripping a single trailing . or ,) against a fixed table of street-type and directional abbreviations. Recognised tokens are replaced with their long form, lowercased; unrecognised tokens are passed through verbatim. Tokens are then re-joined by single spaces.

This function is intentionally simple: it does not apply any position-aware heuristics. The well-known ambiguous case "St" — which can mean Street or Saint — is always expanded to Street. In practice this remains useful for fuzzy matching because the canonical form is consistent on both sides of a comparison; pre-process upstream if you need finer disambiguation.

§Examples

use worker_matcher::Normalizer;
assert_eq!(
    Normalizer::expand_street_abbreviations("123 High St"),
    "123 High street",
);
assert_eq!(
    Normalizer::expand_street_abbreviations("45 N. Park Ave."),
    "45 north Park avenue",
);
assert_eq!(
    Normalizer::expand_street_abbreviations("12 Sunset Blvd"),
    "12 Sunset boulevard",
);

Idempotent on already-expanded inputs (long forms are not re-expanded):

let once = Normalizer::expand_street_abbreviations("10 Downing St");
let twice = Normalizer::expand_street_abbreviations(&once);
assert_eq!(once, twice);

Source

pub fn normalize_address_line(line: &str) -> String

Normalise an address line for comparison.

Pipeline:

Expand street-type and directional abbreviations via Normalizer::expand_street_abbreviations (so "St" → "street", "Rd" → "road", "N" → "north").
Apply the name-normalisation pipeline (Normalizer::normalize_name): NFKD-decompose, drop combining marks, drop ASCII punctuation, lowercase, collapse whitespace.

The result is idempotent and suitable for direct equality or similarity comparison.

§Examples

Abbreviated and full forms canonicalise identically:

use worker_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_address_line("123 High St"),
    Normalizer::normalize_address_line("123 High Street"),
);
assert_eq!(
    Normalizer::normalize_address_line("45 N Park Ave"),
    Normalizer::normalize_address_line("45 North Park Avenue"),
);

Punctuation and case are normalised:

assert_eq!(
    Normalizer::normalize_address_line("10, DOWNING Street."),
    "10 downing street",
);

Source

pub fn parse_address_line(line: &str) -> ParsedAddressLine

Parse an address line into its structured components.

The function performs a best-effort structural decomposition of a single-line postal address into:

house_number — the leading run of digits (with an optional single alphabetic suffix, e.g. "10A"), uppercased. None if no leading number is present.
unit — a recognised sub-unit prefix (Flat, Apt, Apartment, Unit, Suite, Ste) and its identifier, lowercased and space-joined (e.g. "flat 2a"). None if no recognised prefix is present.
street — the remaining text after unit and house_number are removed, run through Normalizer::normalize_address_line.

Parsing is deterministic and format-only — no postal reference is consulted. Inputs that do not match the simple regular structure (e.g. a postcode-only string, a city name) degrade gracefully: house_number and unit are None, and street carries the normalised input.

§Examples

Typical UK / US single-line addresses:

use worker_matcher::Normalizer;

let p = Normalizer::parse_address_line("123 High Street");
assert_eq!(p.house_number.as_deref(), Some("123"));
assert_eq!(p.unit, None);
assert_eq!(p.street, "high street");

let p = Normalizer::parse_address_line("10A Downing St");
assert_eq!(p.house_number.as_deref(), Some("10A"));
assert_eq!(p.street, "downing street");

let p = Normalizer::parse_address_line("Flat 2A, 10 Downing Street");
assert_eq!(p.unit.as_deref(), Some("flat 2a"));
assert_eq!(p.house_number.as_deref(), Some("10"));
assert_eq!(p.street, "downing street");

let p = Normalizer::parse_address_line("Apt 5, 1600 Pennsylvania Ave");
assert_eq!(p.unit.as_deref(), Some("apt 5"));
assert_eq!(p.house_number.as_deref(), Some("1600"));
assert_eq!(p.street, "pennsylvania avenue");

Inputs without a leading number still parse:

let p = Normalizer::parse_address_line("Buckingham Palace");
assert_eq!(p.house_number, None);
assert_eq!(p.unit, None);
assert_eq!(p.street, "buckingham palace");

Source

pub fn phonetic_code(name: &str) -> String

Compute a phonetic (Soundex) code for a name.

Internally, the input is first normalised via Normalizer::normalize_name and then encoded with the American Soundex algorithm. Names that sound alike map to the same code, which lets the matcher catch spelling variants such as “Smith” / “Smyth” or “Stephen” / “Steven”.

The implementation is suitable for English-language names. Non-English phonemes may be lost. T-9 (spec §21.4) decided to keep Soundex as the default and expose an opt-in MatchConfig::phonetic_encoder enum (Double Metaphone, Daitch-Mokotoff) gated behind a Cargo feature flag once an empirical multinational worker corpus is available; implementation is tracked as T-9.1.

§Examples

Similar-sounding spellings share a code:

use worker_matcher::Normalizer;
assert_eq!(Normalizer::phonetic_code("Smith"), Normalizer::phonetic_code("Smyth"));
assert_eq!(Normalizer::phonetic_code("Stephen"), Normalizer::phonetic_code("Steven"));

Different families produce different codes:

assert_ne!(Normalizer::phonetic_code("Jones"), Normalizer::phonetic_code("Smith"));

Empty input returns an empty string, not a default Soundex value:

assert_eq!(Normalizer::phonetic_code(""),       "");
assert_eq!(Normalizer::phonetic_code("   "),    "");

Source

pub fn normalize_email(email: &str, gmail_dot_folding: bool) -> Option<String>

Normalise an email address for comparison.

Steps:

Trim surrounding whitespace.
Lowercase the entire address (RFC 5321 makes the domain case-insensitive and most real-world deployments treat the localpart case-insensitively too; case-sensitive localparts are technically legal but vanishingly rare in healthcare data).
Reject inputs that lack exactly one @ or that have an empty localpart or domain by returning None.
If gmail_dot_folding is true and the domain is gmail.com or googlemail.com, strip every . from the localpart and drop any +tag suffix. Both transformations are reversible for Gmail addresses by Google’s documented routing rules: j.smith@gmail.com, js.mith@gmail.com, and jsmith+work@gmail.com all deliver to the same mailbox as jsmith@gmail.com.

The function is deterministic and idempotent on successful outputs.

§Examples

Common case-and-whitespace normalisation:

use worker_matcher::Normalizer;
assert_eq!(
    Normalizer::normalize_email("  Alice@Example.ORG  ", false),
    Some("alice@example.org".to_string()),
);

Malformed inputs return None:

assert_eq!(Normalizer::normalize_email("no-at-sign", false), None);
assert_eq!(Normalizer::normalize_email("@example.org", false), None);
assert_eq!(Normalizer::normalize_email("alice@", false), None);
assert_eq!(Normalizer::normalize_email("a@b@c", false), None);
assert_eq!(Normalizer::normalize_email("", false), None);

Optional Gmail dot-folding:

assert_eq!(
    Normalizer::normalize_email("j.smith@gmail.com", true),
    Some("jsmith@gmail.com".to_string()),
);
assert_eq!(
    Normalizer::normalize_email("jsmith+work@googlemail.com", true),
    Some("jsmith@googlemail.com".to_string()),
);
// Dot-folding does not touch non-Gmail addresses.
assert_eq!(
    Normalizer::normalize_email("j.smith@example.org", true),
    Some("j.smith@example.org".to_string()),
);

Idempotent on canonical inputs:

let once = Normalizer::normalize_email("Alice@Example.ORG", false).unwrap();
let twice = Normalizer::normalize_email(&once, false).unwrap();
assert_eq!(once, twice);

Auto Trait Implementations§

§

impl UnwindSafe for Normalizer

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

impl<T, U> TryFrom for T
where U: Into<T>,

Source §

type Error = Infallible

The type returned in the event of a conversion error.

Source §

fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

Source §

impl<T, U> TryInto for T
where U: TryFrom<T>,

Source §

type Error = >::Error

The type returned in the event of a conversion error.

Source §

fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Normalizer

Struct Normalizer Copy item path

Implementations§

impl Normalizer

pub fn normalize_name(name: &str) -> String

§Examples

pub fn normalize_postcode(postcode: &str) -> String

§Examples

pub fn normalize_phone(phone: &str) -> String

§Examples

pub fn normalize_phone_e164( phone: &str, default_country: Option<&str>, ) -> Option<String>

§Examples

pub fn expand_street_abbreviations(line: &str) -> String

§Examples

pub fn normalize_address_line(line: &str) -> String

§Examples

pub fn parse_address_line(line: &str) -> ParsedAddressLine

§Examples

pub fn phonetic_code(name: &str) -> String

§Examples

pub fn normalize_email(email: &str, gmail_dot_folding: bool) -> Option<String>

§Examples

Auto Trait Implementations§

impl Freeze for Normalizer

impl RefUnwindSafe for Normalizer

impl Send for Normalizer

impl Sync for Normalizer

impl Unpin for Normalizer

impl UnsafeUnpin for Normalizer

impl UnwindSafe for Normalizer

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Struct Normalizer

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,