# Patient Matching Rust Crate
A comprehensive Rust library for matching patient records in healthcare information exchanges.
## Overview
This crate implements both **deterministic** and **probabilistic** patient matching algorithms based on research from:
- [Patient Matching within a Health Information Exchange](https://pmc.ncbi.nlm.nih.gov/articles/PMC4696093/)
- [Patient Identification Techniques – Approaches, Implications, and Findings](https://pmc.ncbi.nlm.nih.gov/articles/PMC7442501/)
## Features
- ✅ **Deterministic Matching**: Exact matches on NHS numbers and key demographics
- ✅ **Probabilistic Matching**: Fuzzy matching with configurable scoring thresholds
- ✅ **String Similarity Algorithms**: Jaro-Winkler and Levenshtein distance
- ✅ **NHS-Format Identifier Support**: Validation and normalization via the `nhs-number` crate
- ✅ **Phonetic Matching**: Soundex-like algorithm for names (handles "Stephen" vs "Steven")
- ✅ **Diacritic Handling**: Unicode normalisation so accented names match their unaccented form (Siân → Sian, José → Jose)
- ✅ **Address Normalization**: Postcode and street address comparison
- ✅ **Phone Number Normalization**: International / trunk-prefix stripping (`+44`, `0044`, leading `0`)
- ✅ **Configurable Weights**: Customize importance of each field
- ✅ **Serialization Support**: JSON import/export via serde
## Installation
Add to your `Cargo.toml`:
```toml
[dependencies]
patient-matching = "0.1.0"
```
## Usage
### Basic Example
```rust
use patient_matching::{Patient, MatchingEngine, MatchConfig};
use chrono::NaiveDate;
fn main() {
// Create two patient records
let patient1 = Patient::builder()
.given_name("John")
.family_name("Smith")
.date_of_birth(NaiveDate::from_ymd_opt(1980, 5, 15).unwrap())
.nhs_number("1234567890")
.build();
let patient2 = Patient::builder()
.given_name("Jon") // Typo
.family_name("Smith")
.date_of_birth(NaiveDate::from_ymd_opt(1980, 5, 15).unwrap())
.nhs_number("1234567890")
.build();
// Create matching engine with default config
let engine = MatchingEngine::default_config();
// Match patients
let result = engine.match_patients(&patient1, &patient2);
println!("Match score: {:.2}", result.score);
println!("Is match: {}", result.is_match);
println!("Confidence: {:?}", result.confidence);
}
```
### Configurable Matching
```rust
use patient_matching::{MatchConfig, MatchingEngine};
// Strict matching (exact matches required)
let strict_engine = MatchingEngine::new(MatchConfig::strict());
// Lenient matching (more forgiving for typos)
let lenient_engine = MatchingEngine::new(MatchConfig::lenient());
// Custom configuration
let custom_config = MatchConfig {
match_threshold: 0.90,
nhs_number_weight: 0.40, // Increase NHS number importance
given_name_weight: 0.15,
family_name_weight: 0.20,
date_of_birth_weight: 0.15,
use_phonetic_matching: true,
..Default::default()
};
let engine = MatchingEngine::new(custom_config);
```
### Deterministic Matching
```rust
// Check for exact matches only
let is_deterministic_match = engine.deterministic_match(&patient1, &patient2);
if is_deterministic_match {
println!("Exact match on NHS number or all key demographics");
}
```
### Detailed Match Breakdown
```rust
let result = engine.match_patients(&patient1, &patient2);
println!("Overall score: {:.2}", result.score);
println!("NHS number score: {:?}", result.breakdown.nhs_number_score);
println!("Given name score: {:?}", result.breakdown.given_name_score);
println!("Family name score: {:?}", result.breakdown.family_name_score);
println!("Date of birth score: {:?}", result.breakdown.date_of_birth_score);
println!("Address score: {:?}", result.breakdown.address_score);
println!("Phonetic name score: {:?}", result.breakdown.phonetic_name_score);
```
## Patient Data Model
The `Patient` struct supports:
- **NHS Number**: NHS-format 10-digit healthcare identifier with Modulus-11 check digit
- **Name Fields**: First, middle, and Family names
- **Date of Birth**: Birth date for age verification
- **Gender**: Male, Female, Other, Unknown
- **Address**: Multi-line address with postcode
- **Contact**: Phone, mobile, email
- **Local ID**: Hospital/practice-specific identifier
## Matching Algorithm
The matching engine uses a weighted scoring system:
| NHS Number | 30% | Strongest identifier when available |
| Family Name | 20% | Critical demographic |
| Date of Birth | 20% | Age verification |
| Given Name | 15% | Important but subject to nicknames |
| Address | 5% | Supporting evidence |
| Gender | 5% | Supporting evidence |
| Phone | 5% | Supporting evidence |
**Phonetic Matching** provides bonus points when names sound similar (e.g., "Stephen" vs "Steven").
## Research Basis
### Key Findings Applied
1. **No 100% Accuracy**: Research shows even the best algorithms achieve 90-98% accuracy. This crate aims for transparency with confidence scores.
2. **Standardization Critical**: All inputs are normalized:
- Names: lowercase, remove diacritics, trim spaces
- Postcodes: uppercase, remove spaces
- Phone numbers: remove formatting, handle country codes
- NHS numbers: digits only
3. **Multi-Factor Approach**: Following research recommendations, matching uses multiple demographic fields rather than relying on a single identifier.
4. **Weighted Probabilistic Matching**: Combines multiple weak identifiers into a strong match signal, following best practices from health information exchanges.
## Testing
Run the test suite:
```bash
# Unit tests
cargo test
# Integration tests
cargo test --test integration_tests
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_fuzzy_name_match
```
### Test Coverage
- ✅ Perfect matches (100% score)
- ✅ Fuzzy name matching (typos, alternate spellings)
- ✅ Names with diacritics
- ✅ Phonetic name matching
- ✅ Phone number normalization
- ✅ Address comparison
- ✅ NHS number validation
- ✅ Deterministic matching
- ✅ Strict vs lenient modes
- ✅ Missing field handling
- ✅ Serialization/deserialization
## Example: Running the Demo
```bash
cargo run
```
This runs example scenarios including:
1. Perfect match
2. Fuzzy name match (Stephen vs Steven)
3. Names with diacritics (Siân vs Sian)
4. Address matching
5. Complete mismatch
6. Strict vs lenient comparison
## Performance Considerations
- **Time Complexity**: O(1) for deterministic matching, O(n) for string similarity
- **Memory**: Minimal allocation, uses borrowed references where possible
- **Concurrency**: Thread-safe, all operations are immutable
## Limitations
1. **No Machine Learning**: This is a rule-based system, not ML/AI
2. **Single Identifier Scheme**: Optimised for NHS-format check-digit identifiers; other national identifier schemes are not currently validated
3. **No Persistent Storage**: In-memory matching only
4. **No Batch Processing**: Processes pairs of patients
## Future Enhancements
- [ ] Support for other national identifiers (SSN, etc.)
- [ ] Batch matching API for large datasets
- [ ] Machine learning integration
- [ ] Performance benchmarks
- [ ] More sophisticated address parsing
- [ ] International phone number support
## License
MIT OR Apache-2.0
## Contributing
Contributions welcome! Please ensure:
- All tests pass (`cargo test`)
- Code is formatted (`cargo fmt`)
- No clippy warnings (`cargo clippy`)
## References
1. Grannis SJ, et al. "Patient Matching within a Health Information Exchange." AMIA Annu Symp Proc. 2014.
2. Reisman M. "Patient Identification Techniques – Approaches, Implications, and Findings." NCVHS. 2020.
## Contact
For questions, issues, or contributions, contact Joel Henderson at <joel@joelparkerhenderson.com>, or open an issue on the project repository.