Worker matcher Rust Crate
A comprehensive Rust library for matching worker records in healthcare information exchanges.
Documentation index:
index.mdis the top-level map of every doc in this repo (spec, AGENTS guides, CHANGELOG, examples). Start there if you're new.
Overview
This crate implements both deterministic and probabilistic worker matcher algorithms based on research from:
- Worker matcher within a Health Information Exchange
- Patient Identification Techniques – Approaches, Implications, and Findings
Features
- ✅ Deterministic Matching: Exact matches on NHS numbers and key demographics
- ✅ Probabilistic Matching: Fuzzy matching with configurable scoring thresholds and a
Confidenceband (High/Medium/Low) derived from the score for triage UIs - ✅ Batch API:
match_one_to_manyscores one query against many candidates (output parallel to input);rank_one_to_manyreturns the same scores sorted by descending score with a deterministic tiebreak — the building block for screening against a master worker index - ✅ String Similarity Algorithms: Jaro-Winkler and Levenshtein distance
- ✅ NHS-Format Identifier Support: Validation and normalization via the
nhs-numbercrate - ✅ Multinational National Identifiers (42 schemes): UK NHS Number, France NIR, España TSI, Éire IHI, UK Northern Ireland H&C Number, United States SSN, Australia IHI, Germany KVNR, Italy Codice Fiscale, Netherlands BSN, Sweden Workernummer, UK Scotland CHI Number, Belgium National Number, Bulgaria EGN, Czech Rodné číslo, Denmark CPR, Estonia Isikukood, Spain DNI/NIE, Finland HETU, Croatia OIB, Iceland Kennitala, Lithuania Asmens kodas, Latvia Workeras kods, Malta National ID, Norway Fødselsnummer, Poland PESEL, Romania CNP, Slovenia EMŠO, Slovakia Rodné číslo, UK NINO, Greece DSS, Liechtenstein National ID, Netherlands National ID, Poland NIP, Portugal NIF, Brazil CPF, China Resident Identity Card, India Aadhaar, Japan My Number, Mexico CURP, New Zealand NHI, South Africa ID — each scheme-local with its own parser, weight, and breakdown score. Plus 9 per-country passport-format validators (CY, CZ, LI, LT, MT, NL, PT, RO, SK) that feed the multi-country
PassportBookmodel. - ✅ Passport Books:
Vec<PassportBook>onWorkercarries one entry per book with explicit country provenance — supports dual / multi-citizenship, accumulates historical book numbers as passports are renewed, and treats any shared(country, number)pair as a deterministic match (cross-country with same digits never matches) - ✅ Phonetic Matching: Soundex-like algorithm for names (handles "Stephen" vs "Steven")
- ✅ Blood Type Matching:
BloodTypeenum (8 ABO+RhD variants) with a lenient parser accepting canonical (A+), word (A positive), and zero-to-O (0+) variants. Blood type is stable for life, so disagreement is a strong negative signal even though agreement alone is weak. - ✅ Place of Birth Matching:
Worker::birth_placereuses the existingAddresstype (FHIRPatient.birthPlaceparity); dedicated city + country sub-score (0.7 × Jaro-Winkler(city) + 0.3 × exact(country)blend) is diacritic-tolerant and ignores street / postcode fields that aren't meaningful for a place of birth. - ✅ Multiple-Birth (Twin) Disambiguation:
Worker::multiple_birthcarries FHIRPatient.multipleBirth(1-indexed birth order) — the canonical fix for identical-twin records that otherwise share name, DOB, address, and demographic data and would otherwise be wrongly merged. - ✅ Date of Death and Place of Death:
Worker::death_date(FHIRPatient.deceasedDateTime) andWorker::death_placefor deceased-worker records. Death date uses the same DD/MM ↔ MM/DD transposition heuristic as date of birth; place of death shares the0.7 × city + 0.3 × countryblend with place of birth via a sharedscore_named_placehelper. - ✅ Nickname Matching: Opt-in
NicknameTablelifts the given-name score for known nicknames (Michael ↔ Mike, Elizabeth ↔ Liz, Robert ↔ Bob, …); built-in English dictionary plus user-extensible classes - ✅ Diacritic Handling: Unicode normalisation so accented names match their unaccented form (Siân → Sian, José → Jose)
- ✅ Address Normalization: Postcode and street address comparison
- ✅ Sophisticated Address Parsing:
Normalizer::parse_address_lineextracts house number, unit (Flat/Apt/Suite/…), and street;Normalizer::expand_street_abbreviationsunifiesSt/Street,Rd/Road,N/North, etc. so abbreviated and full forms canonicalise identically - ✅ Email Matching:
Normalizer::normalize_emailcanonicalises lowercase + whitespace; optional Gmail dot-folding (j.smith@gmail.com≡jsmith@gmail.com) and+tagstripping behind a config flag - ✅ Phone Number Normalization: International / trunk-prefix stripping (
+44,0044, leading0) - ✅ International Phone Numbers (E.164):
Normalizer::normalize_phone_e164converts inputs to+CCNNN…form across 39 supported countries — every jurisdiction the crate parses a national identifier for (UK, FR, ES, IE, DE, IT, NL, US/CA, AU, JP, BR, BG, CZ, EE, GR, HR, IS, LI, LT, LV, MT, RO, SI, SK, …); the matcher prefers the E.164 form so a French and a UK number with overlapping digits don't collide; Lithuania's non-0(8) national trunk prefix is handled correctly - ✅ Configurable Weights: Customize importance of each field
- ✅ Serialization Support: JSON import/export via serde for all data types and for
MatchConfigitself — load tuning parameters from a file without recompiling
Installation
Add to your Cargo.toml:
[]
= "0.1.0"
Usage
Basic Example
use ;
use NaiveDate;
Configurable Matching
use ;
// Strict matching (exact matches required)
let strict_engine = new;
// Lenient matching (more forgiving for typos)
let lenient_engine = new;
// Custom configuration
let custom_config = MatchConfig ;
let engine = new;
Deterministic Matching
// Check for exact matches only
let is_deterministic_match = engine.deterministic_match;
if is_deterministic_match
Detailed Match Breakdown
let result = engine.match_workers;
println!;
println!;
println!;
println!;
println!;
println!;
println!;
Worker Data Model
The Worker struct supports:
- NHS Number: NHS-format 10-digit healthcare identifier with Modulus-11 check digit
- Name Fields: First, middle, and Family names
- Date of Birth: Birth date for age verification
- Gender: Male, Female, Other, Unknown
- Address: Multi-line address with postcode
- Contact: Phone, mobile, email
- Local ID: Hospital/practice-specific identifier
Matching Algorithm
The matching engine uses a weighted scoring system:
| Field | Default Weight | Purpose |
|---|---|---|
| NHS Number | 30% | Strongest identifier when available |
| Family Name | 20% | Critical demographic |
| Date of Birth | 20% | Age verification |
| Given Name | 15% | Important but subject to nicknames |
| Address | 5% | Supporting evidence |
| Gender | 5% | Supporting evidence |
| Phone | 5% | Supporting evidence |
Phonetic Matching provides bonus points when names sound similar (e.g., "Stephen" vs "Steven").
Research Basis
Key Findings Applied
-
No 100% Accuracy: Research shows even the best algorithms achieve 90-98% accuracy. This crate aims for transparency with confidence scores.
-
Standardization Critical: All inputs are normalized:
- Names: lowercase, remove diacritics, trim spaces
- Postcodes: uppercase, remove spaces
- Phone numbers: remove formatting, handle country codes
- NHS numbers: digits only
-
Multi-Factor Approach: Following research recommendations, matching uses multiple demographic fields rather than relying on a single identifier.
-
Weighted Probabilistic Matching: Combines multiple weak identifiers into a strong match signal, following best practices from health information exchanges.
Testing
Run the test suite:
# Unit tests
# Integration tests
# Run with output
# Run specific test
# Property tests (1000 proptest cases per property)
Benchmarks
Criterion benchmarks live in benches/match_pair.rs and exercise the hot paths a downstream MPI integrator will care about:
# Run all benches (HTML reports → target/criterion/)
# Smoke run (fast, lower statistical power)
# A single bench by name
Indicative numbers on a 2024 Apple Silicon machine: single-pair fuzzy match ~4 µs, deterministic identifier hit ~160 ns, batch ranking ~3 µs per candidate — well under the spec §17 budget of < 50 µs per pair.
Test Coverage
- ✅ Perfect matches (100% score)
- ✅ Fuzzy name matching (typos, alternate spellings)
- ✅ Names with diacritics
- ✅ Phonetic name matching
- ✅ Phone number normalization
- ✅ Address comparison
- ✅ NHS number validation
- ✅ Deterministic matching
- ✅ Strict vs lenient modes
- ✅ Missing field handling
- ✅ Serialization/deserialization
Example: Running the Demo
This runs example scenarios including:
- Perfect match
- Fuzzy name match (Stephen vs Steven)
- Names with diacritics (Siân vs Sian)
- Address matching
- Complete mismatch
- Strict vs lenient comparison
Performance Considerations
- Time Complexity: O(1) for deterministic matching, O(n) for string similarity
- Memory: Minimal allocation, uses borrowed references where possible
- Concurrency: Thread-safe, all operations are immutable
Limitations
- No Machine Learning: This is a rule-based system, not ML/AI
- Single Identifier Scheme: Optimised for NHS-format check-digit identifiers; other national identifier schemes are not currently validated
- No Persistent Storage: In-memory matching only
- No Batch Processing: Processes pairs of workers
International Phone Numbers
The crate exposes two phone normalisers:
Normalizer::normalize_phone(phone) -> String— legacy UK-centric national-significant form. Infallible.Normalizer::normalize_phone_e164(phone, default_country) -> Option<String>— international E.164 form (+CCNNN…). ReturnsNoneif the input cannot be confidently parsed against the supported country table.
MatchingEngine::match_workers uses the E.164 form first and falls back to the legacy form when either input cannot be parsed. The default country is configured via MatchConfig::phone_default_country (defaults to Some("GB")):
use ;
// Direct call:
assert_eq!;
assert_eq!;
// Via the matcher, with a non-UK default:
let cfg = MatchConfig ;
let engine = new;
let p1 = builder
.given_name.family_name
.phone.build;
let p2 = builder
.given_name.family_name
.phone.build;
assert_eq!;
Supported countries: UK, France, Spain, Ireland, UK Northern Ireland (via GB dial code), Germany, Italy, Netherlands, Belgium, Portugal, Switzerland, Austria, Sweden, Norway, Denmark, Finland, Poland, Australia, New Zealand, US, Canada, Japan, China, India, Brazil, Mexico, South Africa. See spec.md §14.3.2 for the full table.
Passport Books
Passport book numbers don't fit the per-scheme Option<String> national-identifier pattern: a worker may hold passports from several countries, each book has its own number, and book numbers change with each renewal. The crate models this directly with a Vec<PassportBook> on Worker:
use NaiveDate;
use ;
let alice = builder
.given_name
.family_name
// Current UK passport
.add_passport_book
// Dual citizen: also carries a US passport
.add_passport_book
// Historical UK book, kept for cross-system matching
.add_passport_book
.build;
// Other system has only the historical UK book recorded.
let same_alice = builder
.given_name
.family_name
.add_passport_book
.build;
let engine = default_config;
assert!;
Matching semantics:
- The
countryis part of the comparison key — a UKAB123456is a different identifier from a USAB123456, and they never cross-match. - Any shared
(country, number)pair across the two workers' lists is sufficient for a deterministic match. A multi-country worker matches another record that carries any one of their books. - Historical and current books mix freely in the same
Vec. A renewal that produces a new book number doesn't invalidate the old one for matching purposes — keep both. issued/expiresdates are metadata for downstream display and audit; they are NOT used in matching.
Batch Scoring
For master-worker-index workflows, screen one query against many candidates:
use ;
let engine = default_config;
let query = builder.given_name.family_name.build;
let candidates: = vec!;
// Parallel-to-input results — keep the original index by zipping:
let results = engine.match_one_to_many;
for in results.iter.enumerate
// Ranked results — best first, tied scores ordered by original index:
let ranked = engine.rank_one_to_many;
let = &ranked;
println!;
The engine is Send + Sync, so wrap calls in rayon::par_iter or any other parallelism primitive without changes to this crate. Candidate-blocking (Soundex prefix, postcode outward code, date-of-birth year, …) is intentionally not baked into the API — pre-filter the candidate slice in your application layer.
Loading Config from JSON
MatchConfig, SimilarityAlgorithm, and NicknameTable all derive serde::Serialize and serde::Deserialize, so tuning parameters can live in a config file:
use ;
let json = r#"{
"match_threshold": 0.80,
"phone_default_country": "US",
"gmail_dot_folding": true
}"#;
// `#[serde(default)]` on MatchConfig: omitted fields inherit from MatchConfig::default().
let cfg: MatchConfig = from_str.unwrap;
let engine = new;
# let _ = engine;
Email Matching
Email addresses are normalised (trim + lowercase, structural validation) and compared for exact equality. The matcher writes Some(1.0)/Some(0.0) to MatchBreakdown::email_score, or None when either side is missing or malformed:
use ;
let a = builder.given_name.family_name
.email.build;
let b = builder.given_name.family_name
.email.build;
let r = default_config.match_workers;
assert_eq!;
Gmail's documented routing rules treat j.smith@gmail.com, jsmith@gmail.com, and jsmith+work@gmail.com as the same mailbox. Opt in via MatchConfig::gmail_dot_folding:
# use ;
let cfg = MatchConfig ;
let engine = new;
let a = builder.given_name.family_name
.email.build;
let b = builder.given_name.family_name
.email.build;
assert_eq!;
Note that local_id is not scored: different organisations may issue colliding values (different workers' MRNs from hospitals A and B can be byte-equal), so positional matching would produce false positives.
Nickname Matching
Nicknames are an opt-in feature. Enable the built-in English dictionary on MatchConfig::nickname_table and the matcher will lift the per-name score to ≥ 0.9 for any pair the table considers equivalent:
use ;
let cfg = MatchConfig ;
let engine = new;
let a = builder.given_name.family_name.build;
let b = builder.given_name.family_name.build;
let r = engine.match_workers;
assert!;
// Extend the dictionary with your own classes:
let cfg = MatchConfig ;
The default table is empty (NicknameTable::empty()) so existing callers see no behaviour change. The exact contents of NicknameTable::english() are NOT part of the stable contract — entries may be added in minor releases. Pin a custom table via with_class if you need deterministic behaviour across upgrades.
Address Parsing
Normalizer::parse_address_line decomposes a single-line postal address into its components:
use ;
let p: ParsedAddressLine = parse_address_line;
assert_eq!;
assert_eq!;
assert_eq!;
// `normalize_address_line` expands abbreviations and applies the name pipeline:
assert_eq!;
assert_eq!;
The matcher uses this internally so "123 High St" and "123 High Street" no longer suffer a Jaro-Winkler penalty for the abbreviation. Mismatching house numbers (e.g. "10 Downing St" vs "20 Downing St") penalise the address sub-score even when the street name is identical. See spec.md §12.4.1 / §14.4a for the algorithm.
Future Enhancements
- Support for other national identifiers (SSN, etc.)
- Batch matching API for large datasets
- Machine learning integration
- Performance benchmarks
- More sophisticated address parsing
- Broader phone-number country coverage and mobile-vs-landline validation
License
MIT or BSD or Apache-2.0 or GPL-2.0 or GPL-3.0 or contact us for more.
Contributing
Contributions welcome! Please ensure:
- All tests pass (
cargo test) - Code is formatted (
cargo fmt) - No clippy warnings (
cargo clippy)
References
- Grannis SJ, et al. "Worker matcher within a Health Information Exchange." AMIA Annu Symp Proc. 2014.
- Reisman M. "Patient Identification Techniques – Approaches, Implications, and Findings." NCVHS. 2020.
Contact
For questions, issues, or contributions, contact Joel Henderson at joel@joelparkerhenderson.com, or open an issue on the project repository.