worker-matcher 0.3.0

Worker matcher for healthcare information exchange: deterministic and probabilistic matching with multinational national identifiers (UK NHS / FR NIR / ES TSI / IE IHI / UK NI H&C / US SSN), E.164 phone normalisation, address parsing, nickname dictionary, email scoring, and explainable per-field breakdowns.
Documentation

Worker matcher Rust Crate

A comprehensive Rust library for matching worker records in healthcare information exchanges.

Documentation index: index.md is the top-level map of every doc in this repo (spec, AGENTS guides, CHANGELOG, examples). Start there if you're new.

Overview

This crate implements both deterministic and probabilistic worker matcher algorithms based on research from:

Features

  • Deterministic Matching: Exact matches on NHS numbers and key demographics
  • Probabilistic Matching: Fuzzy matching with configurable scoring thresholds and a Confidence band (High/Medium/Low) derived from the score for triage UIs
  • Batch API: match_one_to_many scores one query against many candidates (output parallel to input); rank_one_to_many returns the same scores sorted by descending score with a deterministic tiebreak — the building block for screening against a master worker index
  • String Similarity Algorithms: Jaro-Winkler and Levenshtein distance
  • NHS-Format Identifier Support: Validation and normalization via the nhs-number crate
  • Multinational National Identifiers (42 schemes): UK NHS Number, France NIR, España TSI, Éire IHI, UK Northern Ireland H&C Number, United States SSN, Australia IHI, Germany KVNR, Italy Codice Fiscale, Netherlands BSN, Sweden Workernummer, UK Scotland CHI Number, Belgium National Number, Bulgaria EGN, Czech Rodné číslo, Denmark CPR, Estonia Isikukood, Spain DNI/NIE, Finland HETU, Croatia OIB, Iceland Kennitala, Lithuania Asmens kodas, Latvia Workeras kods, Malta National ID, Norway Fødselsnummer, Poland PESEL, Romania CNP, Slovenia EMŠO, Slovakia Rodné číslo, UK NINO, Greece DSS, Liechtenstein National ID, Netherlands National ID, Poland NIP, Portugal NIF, Brazil CPF, China Resident Identity Card, India Aadhaar, Japan My Number, Mexico CURP, New Zealand NHI, South Africa ID — each scheme-local with its own parser, weight, and breakdown score. Plus 9 per-country passport-format validators (CY, CZ, LI, LT, MT, NL, PT, RO, SK) that feed the multi-country PassportBook model.
  • Passport Books: Vec<PassportBook> on Worker carries one entry per book with explicit country provenance — supports dual / multi-citizenship, accumulates historical book numbers as passports are renewed, and treats any shared (country, number) pair as a deterministic match (cross-country with same digits never matches)
  • Phonetic Matching: Soundex-like algorithm for names (handles "Stephen" vs "Steven")
  • Blood Type Matching: BloodType enum (8 ABO+RhD variants) with a lenient parser accepting canonical (A+), word (A positive), and zero-to-O (0+) variants. Blood type is stable for life, so disagreement is a strong negative signal even though agreement alone is weak.
  • Place of Birth Matching: Worker::birth_place reuses the existing Address type (FHIR Patient.birthPlace parity); dedicated city + country sub-score (0.7 × Jaro-Winkler(city) + 0.3 × exact(country) blend) is diacritic-tolerant and ignores street / postcode fields that aren't meaningful for a place of birth.
  • Multiple-Birth (Twin) Disambiguation: Worker::multiple_birth carries FHIR Patient.multipleBirth (1-indexed birth order) — the canonical fix for identical-twin records that otherwise share name, DOB, address, and demographic data and would otherwise be wrongly merged.
  • Date of Death and Place of Death: Worker::death_date (FHIR Patient.deceasedDateTime) and Worker::death_place for deceased-worker records. Death date uses the same DD/MM ↔ MM/DD transposition heuristic as date of birth; place of death shares the 0.7 × city + 0.3 × country blend with place of birth via a shared score_named_place helper.
  • Nickname Matching: Opt-in NicknameTable lifts the given-name score for known nicknames (Michael ↔ Mike, Elizabeth ↔ Liz, Robert ↔ Bob, …); built-in English dictionary plus user-extensible classes
  • Diacritic Handling: Unicode normalisation so accented names match their unaccented form (Siân → Sian, José → Jose)
  • Address Normalization: Postcode and street address comparison
  • Sophisticated Address Parsing: Normalizer::parse_address_line extracts house number, unit (Flat/Apt/Suite/…), and street; Normalizer::expand_street_abbreviations unifies St/Street, Rd/Road, N/North, etc. so abbreviated and full forms canonicalise identically
  • Email Matching: Normalizer::normalize_email canonicalises lowercase + whitespace; optional Gmail dot-folding (j.smith@gmail.comjsmith@gmail.com) and +tag stripping behind a config flag
  • Phone Number Normalization: International / trunk-prefix stripping (+44, 0044, leading 0)
  • International Phone Numbers (E.164): Normalizer::normalize_phone_e164 converts inputs to +CCNNN… form across 39 supported countries — every jurisdiction the crate parses a national identifier for (UK, FR, ES, IE, DE, IT, NL, US/CA, AU, JP, BR, BG, CZ, EE, GR, HR, IS, LI, LT, LV, MT, RO, SI, SK, …); the matcher prefers the E.164 form so a French and a UK number with overlapping digits don't collide; Lithuania's non-0 (8) national trunk prefix is handled correctly
  • Configurable Weights: Customize importance of each field
  • Serialization Support: JSON import/export via serde for all data types and for MatchConfig itself — load tuning parameters from a file without recompiling

Installation

Add to your Cargo.toml:

[dependencies]
worker-matcher = "0.1.0"

Usage

Basic Example

use worker_matcher::{Worker, MatchingEngine, MatchConfig};
use chrono::NaiveDate;

fn main() {
    // Create two worker records
    let worker1 = Worker::builder()
        .given_name("John")
        .family_name("Smith")
        .date_of_birth(NaiveDate::from_ymd_opt(1980, 5, 15).unwrap())
        .nhs_number("1234567890")
        .build();

    let worker2 = Worker::builder()
        .given_name("Jon")  // Typo
        .family_name("Smith")
        .date_of_birth(NaiveDate::from_ymd_opt(1980, 5, 15).unwrap())
        .nhs_number("1234567890")
        .build();

    // Create matching engine with default config
    let engine = MatchingEngine::default_config();

    // Match workers
    let result = engine.match_workers(&worker1, &worker2);

    println!("Match score: {:.2}", result.score);
    println!("Is match: {}", result.is_match);
    println!("Confidence: {:?}", result.confidence); // High / Medium / Low
}

Configurable Matching

use worker_matcher::{MatchConfig, MatchingEngine};

// Strict matching (exact matches required)
let strict_engine = MatchingEngine::new(MatchConfig::strict());

// Lenient matching (more forgiving for typos)
let lenient_engine = MatchingEngine::new(MatchConfig::lenient());

// Custom configuration
let custom_config = MatchConfig {
    match_threshold: 0.90,
    nhs_number_weight: 0.40,  // Increase NHS number importance
    given_name_weight: 0.15,
    family_name_weight: 0.20,
    date_of_birth_weight: 0.15,
    use_phonetic_matching: true,
    ..Default::default()
};

let engine = MatchingEngine::new(custom_config);

Deterministic Matching

// Check for exact matches only
let is_deterministic_match = engine.deterministic_match(&worker1, &worker2);

if is_deterministic_match {
    println!("Exact match on NHS number or all key demographics");
}

Detailed Match Breakdown

let result = engine.match_workers(&worker1, &worker2);

println!("Overall score: {:.2}", result.score);
println!("NHS number score: {:?}", result.breakdown.nhs_number_score);
println!("Given name score: {:?}", result.breakdown.given_name_score);
println!("Family name score: {:?}", result.breakdown.family_name_score);
println!("Date of birth score: {:?}", result.breakdown.date_of_birth_score);
println!("Address score: {:?}", result.breakdown.address_score);
println!("Phonetic name score: {:?}", result.breakdown.phonetic_name_score);

Worker Data Model

The Worker struct supports:

  • NHS Number: NHS-format 10-digit healthcare identifier with Modulus-11 check digit
  • Name Fields: First, middle, and Family names
  • Date of Birth: Birth date for age verification
  • Gender: Male, Female, Other, Unknown
  • Address: Multi-line address with postcode
  • Contact: Phone, mobile, email
  • Local ID: Hospital/practice-specific identifier

Matching Algorithm

The matching engine uses a weighted scoring system:

Field Default Weight Purpose
NHS Number 30% Strongest identifier when available
Family Name 20% Critical demographic
Date of Birth 20% Age verification
Given Name 15% Important but subject to nicknames
Address 5% Supporting evidence
Gender 5% Supporting evidence
Phone 5% Supporting evidence

Phonetic Matching provides bonus points when names sound similar (e.g., "Stephen" vs "Steven").

Research Basis

Key Findings Applied

  1. No 100% Accuracy: Research shows even the best algorithms achieve 90-98% accuracy. This crate aims for transparency with confidence scores.

  2. Standardization Critical: All inputs are normalized:

    • Names: lowercase, remove diacritics, trim spaces
    • Postcodes: uppercase, remove spaces
    • Phone numbers: remove formatting, handle country codes
    • NHS numbers: digits only
  3. Multi-Factor Approach: Following research recommendations, matching uses multiple demographic fields rather than relying on a single identifier.

  4. Weighted Probabilistic Matching: Combines multiple weak identifiers into a strong match signal, following best practices from health information exchanges.

Testing

Run the test suite:

# Unit tests
cargo test

# Integration tests
cargo test --test integration_tests

# Run with output
cargo test -- --nocapture

# Run specific test
cargo test test_fuzzy_name_match

# Property tests (1000 proptest cases per property)
cargo test --test property_tests

Benchmarks

Criterion benchmarks live in benches/match_pair.rs and exercise the hot paths a downstream MPI integrator will care about:

# Run all benches (HTML reports → target/criterion/)
cargo bench

# Smoke run (fast, lower statistical power)
cargo bench -- --quick

# A single bench by name
cargo bench --bench match_pair -- match_pair/fuzzy_near_match

Indicative numbers on a 2024 Apple Silicon machine: single-pair fuzzy match ~4 µs, deterministic identifier hit ~160 ns, batch ranking ~3 µs per candidate — well under the spec §17 budget of < 50 µs per pair.

Test Coverage

  • ✅ Perfect matches (100% score)
  • ✅ Fuzzy name matching (typos, alternate spellings)
  • ✅ Names with diacritics
  • ✅ Phonetic name matching
  • ✅ Phone number normalization
  • ✅ Address comparison
  • ✅ NHS number validation
  • ✅ Deterministic matching
  • ✅ Strict vs lenient modes
  • ✅ Missing field handling
  • ✅ Serialization/deserialization

Example: Running the Demo

cargo run

This runs example scenarios including:

  1. Perfect match
  2. Fuzzy name match (Stephen vs Steven)
  3. Names with diacritics (Siân vs Sian)
  4. Address matching
  5. Complete mismatch
  6. Strict vs lenient comparison

Performance Considerations

  • Time Complexity: O(1) for deterministic matching, O(n) for string similarity
  • Memory: Minimal allocation, uses borrowed references where possible
  • Concurrency: Thread-safe, all operations are immutable

Limitations

  1. No Machine Learning: This is a rule-based system, not ML/AI
  2. Single Identifier Scheme: Optimised for NHS-format check-digit identifiers; other national identifier schemes are not currently validated
  3. No Persistent Storage: In-memory matching only
  4. No Batch Processing: Processes pairs of workers

International Phone Numbers

The crate exposes two phone normalisers:

  • Normalizer::normalize_phone(phone) -> String — legacy UK-centric national-significant form. Infallible.
  • Normalizer::normalize_phone_e164(phone, default_country) -> Option<String> — international E.164 form (+CCNNN…). Returns None if the input cannot be confidently parsed against the supported country table.

MatchingEngine::match_workers uses the E.164 form first and falls back to the legacy form when either input cannot be parsed. The default country is configured via MatchConfig::phone_default_country (defaults to Some("GB")):

use worker_matcher::{MatchConfig, MatchingEngine, Normalizer, Worker};

// Direct call:
assert_eq!(
    Normalizer::normalize_phone_e164("+44 7700 900123", Some("GB")),
    Some("+447700900123".to_string()),
);
assert_eq!(
    Normalizer::normalize_phone_e164("01 23 45 67 89", Some("FR")),
    Some("+33123456789".to_string()),
);

// Via the matcher, with a non-UK default:
let cfg = MatchConfig {
    phone_default_country: Some("FR".into()),
    ..MatchConfig::default()
};
let engine = MatchingEngine::new(cfg);
let p1 = Worker::builder()
    .given_name("Jean").family_name("Dupont")
    .phone("01 23 45 67 89").build();
let p2 = Worker::builder()
    .given_name("Jean").family_name("Dupont")
    .phone("+33 1 23 45 67 89").build();
assert_eq!(engine.match_workers(&p1, &p2).breakdown.phone_score, Some(1.0));

Supported countries: UK, France, Spain, Ireland, UK Northern Ireland (via GB dial code), Germany, Italy, Netherlands, Belgium, Portugal, Switzerland, Austria, Sweden, Norway, Denmark, Finland, Poland, Australia, New Zealand, US, Canada, Japan, China, India, Brazil, Mexico, South Africa. See spec.md §14.3.2 for the full table.

Passport Books

Passport book numbers don't fit the per-scheme Option<String> national-identifier pattern: a worker may hold passports from several countries, each book has its own number, and book numbers change with each renewal. The crate models this directly with a Vec<PassportBook> on Worker:

use chrono::NaiveDate;
use worker_matcher::{MatchingEngine, PassportBook, Worker};

let alice = Worker::builder()
    .given_name("Alice")
    .family_name("Anderson")
    // Current UK passport
    .add_passport_book(
        PassportBook::new("GB", "123456789").unwrap()
            .with_issued(NaiveDate::from_ymd_opt(2024, 6, 1).unwrap())
            .with_expires(NaiveDate::from_ymd_opt(2034, 6, 1).unwrap()),
    )
    // Dual citizen: also carries a US passport
    .add_passport_book(PassportBook::new("US", "AB1234567").unwrap())
    // Historical UK book, kept for cross-system matching
    .add_passport_book(PassportBook::new("GB", "ORIGINAL000").unwrap())
    .build();

// Other system has only the historical UK book recorded.
let same_alice = Worker::builder()
    .given_name("Alice")
    .family_name("Anderson")
    .add_passport_book(PassportBook::new("GB", "original000").unwrap())
    .build();

let engine = MatchingEngine::default_config();
assert!(engine.deterministic_match(&alice, &same_alice));

Matching semantics:

  • The country is part of the comparison key — a UK AB123456 is a different identifier from a US AB123456, and they never cross-match.
  • Any shared (country, number) pair across the two workers' lists is sufficient for a deterministic match. A multi-country worker matches another record that carries any one of their books.
  • Historical and current books mix freely in the same Vec. A renewal that produces a new book number doesn't invalidate the old one for matching purposes — keep both.
  • issued / expires dates are metadata for downstream display and audit; they are NOT used in matching.

Batch Scoring

For master-worker-index workflows, screen one query against many candidates:

use worker_matcher::{MatchingEngine, Worker};

let engine = MatchingEngine::default_config();
let query = Worker::builder().given_name("Ada").family_name("Lovelace").build();
let candidates: Vec<Worker> = vec![
    Worker::builder().given_name("Grace").family_name("Hopper").build(),
    Worker::builder().given_name("Ada").family_name("Lovelace").build(),   // best match
    Worker::builder().given_name("Alan").family_name("Turing").build(),
];

// Parallel-to-input results — keep the original index by zipping:
let results = engine.match_one_to_many(&query, &candidates);
for (i, r) in results.iter().enumerate() {
    println!("candidate[{i}]: score={:.2}, match={}", r.score, r.is_match);
}

// Ranked results — best first, tied scores ordered by original index:
let ranked = engine.rank_one_to_many(&query, &candidates);
let (idx, top) = &ranked[0];
println!("best match is candidate[{idx}] with score {:.2}", top.score);

The engine is Send + Sync, so wrap calls in rayon::par_iter or any other parallelism primitive without changes to this crate. Candidate-blocking (Soundex prefix, postcode outward code, date-of-birth year, …) is intentionally not baked into the API — pre-filter the candidate slice in your application layer.

Loading Config from JSON

MatchConfig, SimilarityAlgorithm, and NicknameTable all derive serde::Serialize and serde::Deserialize, so tuning parameters can live in a config file:

use worker_matcher::{MatchConfig, MatchingEngine};

let json = r#"{
    "match_threshold": 0.80,
    "phone_default_country": "US",
    "gmail_dot_folding": true
}"#;
// `#[serde(default)]` on MatchConfig: omitted fields inherit from MatchConfig::default().
let cfg: MatchConfig = serde_json::from_str(json).unwrap();
let engine = MatchingEngine::new(cfg);
# let _ = engine;

Email Matching

Email addresses are normalised (trim + lowercase, structural validation) and compared for exact equality. The matcher writes Some(1.0)/Some(0.0) to MatchBreakdown::email_score, or None when either side is missing or malformed:

use worker_matcher::{MatchConfig, MatchingEngine, Worker};

let a = Worker::builder().given_name("Alice").family_name("Anderson")
    .email("  Alice@Example.ORG  ").build();
let b = Worker::builder().given_name("Alice").family_name("Anderson")
    .email("alice@example.org").build();
let r = MatchingEngine::default_config().match_workers(&a, &b);
assert_eq!(r.breakdown.email_score, Some(1.0));

Gmail's documented routing rules treat j.smith@gmail.com, jsmith@gmail.com, and jsmith+work@gmail.com as the same mailbox. Opt in via MatchConfig::gmail_dot_folding:

# use worker_matcher::{MatchConfig, MatchingEngine, Worker};
let cfg = MatchConfig { gmail_dot_folding: true, ..MatchConfig::default() };
let engine = MatchingEngine::new(cfg);

let a = Worker::builder().given_name("X").family_name("Y")
    .email("j.smith@gmail.com").build();
let b = Worker::builder().given_name("X").family_name("Y")
    .email("jsmith+work@gmail.com").build();
assert_eq!(engine.match_workers(&a, &b).breakdown.email_score, Some(1.0));

Note that local_id is not scored: different organisations may issue colliding values (different workers' MRNs from hospitals A and B can be byte-equal), so positional matching would produce false positives.

Nickname Matching

Nicknames are an opt-in feature. Enable the built-in English dictionary on MatchConfig::nickname_table and the matcher will lift the per-name score to ≥ 0.9 for any pair the table considers equivalent:

use worker_matcher::{MatchConfig, MatchingEngine, NicknameTable, Worker};

let cfg = MatchConfig {
    nickname_table: NicknameTable::english(),
    ..MatchConfig::default()
};
let engine = MatchingEngine::new(cfg);

let a = Worker::builder().given_name("Michael").family_name("Jones").build();
let b = Worker::builder().given_name("Mike").family_name("Jones").build();
let r = engine.match_workers(&a, &b);
assert!(r.breakdown.given_name_score.unwrap() >= 0.9);

// Extend the dictionary with your own classes:
let cfg = MatchConfig {
    nickname_table: NicknameTable::english().with_class(["Reginald", "Reggie"]),
    ..MatchConfig::default()
};

The default table is empty (NicknameTable::empty()) so existing callers see no behaviour change. The exact contents of NicknameTable::english() are NOT part of the stable contract — entries may be added in minor releases. Pin a custom table via with_class if you need deterministic behaviour across upgrades.

Address Parsing

Normalizer::parse_address_line decomposes a single-line postal address into its components:

use worker_matcher::{Normalizer, ParsedAddressLine};

let p: ParsedAddressLine = Normalizer::parse_address_line("Flat 2A, 10 Downing Street");
assert_eq!(p.unit.as_deref(),         Some("flat 2a"));
assert_eq!(p.house_number.as_deref(), Some("10"));
assert_eq!(p.street,                  "downing street");

// `normalize_address_line` expands abbreviations and applies the name pipeline:
assert_eq!(
    Normalizer::normalize_address_line("123 High St"),
    Normalizer::normalize_address_line("123 High Street"),
);
assert_eq!(
    Normalizer::normalize_address_line("45 N Park Ave"),
    "45 north park avenue",
);

The matcher uses this internally so "123 High St" and "123 High Street" no longer suffer a Jaro-Winkler penalty for the abbreviation. Mismatching house numbers (e.g. "10 Downing St" vs "20 Downing St") penalise the address sub-score even when the street name is identical. See spec.md §12.4.1 / §14.4a for the algorithm.

Future Enhancements

  • Support for other national identifiers (SSN, etc.)
  • Batch matching API for large datasets
  • Machine learning integration
  • Performance benchmarks
  • More sophisticated address parsing
  • Broader phone-number country coverage and mobile-vs-landline validation

License

MIT or BSD or Apache-2.0 or GPL-2.0 or GPL-3.0 or contact us for more.

Contributing

Contributions welcome! Please ensure:

  • All tests pass (cargo test)
  • Code is formatted (cargo fmt)
  • No clippy warnings (cargo clippy)

References

  1. Grannis SJ, et al. "Worker matcher within a Health Information Exchange." AMIA Annu Symp Proc. 2014.
  2. Reisman M. "Patient Identification Techniques – Approaches, Implications, and Findings." NCVHS. 2020.

Contact

For questions, issues, or contributions, contact Joel Henderson at joel@joelparkerhenderson.com, or open an issue on the project repository.