patient-matching 0.2.0

Patient matching algorithms for healthcare information exchange
Documentation
# Patient Matching Rust Crate

A comprehensive Rust library for matching patient records in healthcare information exchanges.

## Overview

This crate implements both **deterministic** and **probabilistic** patient matching algorithms based on research from:
- [Patient Matching within a Health Information Exchange]https://pmc.ncbi.nlm.nih.gov/articles/PMC4696093/
- [Patient Identification Techniques – Approaches, Implications, and Findings]https://pmc.ncbi.nlm.nih.gov/articles/PMC7442501/

## Features

- **Deterministic Matching**: Exact matches on NHS numbers and key demographics
-**Probabilistic Matching**: Fuzzy matching with configurable scoring thresholds
-**String Similarity Algorithms**: Jaro-Winkler and Levenshtein distance
-**NHS-Format Identifier Support**: Validation and normalization via the `nhs-number` crate
-**Phonetic Matching**: Soundex-like algorithm for names (handles "Stephen" vs "Steven")
-**Diacritic Handling**: Unicode normalisation so accented names match their unaccented form (Siân → Sian, José → Jose)
-**Address Normalization**: Postcode and street address comparison
-**Phone Number Normalization**: International / trunk-prefix stripping (`+44`, `0044`, leading `0`)
-**Configurable Weights**: Customize importance of each field
-**Serialization Support**: JSON import/export via serde

## Installation

Add to your `Cargo.toml`:

```toml
[dependencies]
patient-matching = "0.1.0"
```

## Usage

### Basic Example

```rust
use patient_matching::{Patient, MatchingEngine, MatchConfig};
use chrono::NaiveDate;

fn main() {
    // Create two patient records
    let patient1 = Patient::builder()
        .given_name("John")
        .family_name("Smith")
        .date_of_birth(NaiveDate::from_ymd_opt(1980, 5, 15).unwrap())
        .nhs_number("1234567890")
        .build();

    let patient2 = Patient::builder()
        .given_name("Jon")  // Typo
        .family_name("Smith")
        .date_of_birth(NaiveDate::from_ymd_opt(1980, 5, 15).unwrap())
        .nhs_number("1234567890")
        .build();

    // Create matching engine with default config
    let engine = MatchingEngine::default_config();

    // Match patients
    let result = engine.match_patients(&patient1, &patient2);

    println!("Match score: {:.2}", result.score);
    println!("Is match: {}", result.is_match);
    println!("Confidence: {:?}", result.confidence);
}
```

### Configurable Matching

```rust
use patient_matching::{MatchConfig, MatchingEngine};

// Strict matching (exact matches required)
let strict_engine = MatchingEngine::new(MatchConfig::strict());

// Lenient matching (more forgiving for typos)
let lenient_engine = MatchingEngine::new(MatchConfig::lenient());

// Custom configuration
let custom_config = MatchConfig {
    match_threshold: 0.90,
    nhs_number_weight: 0.40,  // Increase NHS number importance
    given_name_weight: 0.15,
    family_name_weight: 0.20,
    date_of_birth_weight: 0.15,
    use_phonetic_matching: true,
    ..Default::default()
};

let engine = MatchingEngine::new(custom_config);
```

### Deterministic Matching

```rust
// Check for exact matches only
let is_deterministic_match = engine.deterministic_match(&patient1, &patient2);

if is_deterministic_match {
    println!("Exact match on NHS number or all key demographics");
}
```

### Detailed Match Breakdown

```rust
let result = engine.match_patients(&patient1, &patient2);

println!("Overall score: {:.2}", result.score);
println!("NHS number score: {:?}", result.breakdown.nhs_number_score);
println!("Given name score: {:?}", result.breakdown.given_name_score);
println!("Family name score: {:?}", result.breakdown.family_name_score);
println!("Date of birth score: {:?}", result.breakdown.date_of_birth_score);
println!("Address score: {:?}", result.breakdown.address_score);
println!("Phonetic name score: {:?}", result.breakdown.phonetic_name_score);
```

## Patient Data Model

The `Patient` struct supports:

- **NHS Number**: NHS-format 10-digit healthcare identifier with Modulus-11 check digit
- **Name Fields**: First, middle, and Family names
- **Date of Birth**: Birth date for age verification
- **Gender**: Male, Female, Other, Unknown
- **Address**: Multi-line address with postcode
- **Contact**: Phone, mobile, email
- **Local ID**: Hospital/practice-specific identifier

## Matching Algorithm

The matching engine uses a weighted scoring system:

| Field | Default Weight | Purpose |
|-------|----------------|---------|
| NHS Number | 30% | Strongest identifier when available |
| Family Name | 20% | Critical demographic |
| Date of Birth | 20% | Age verification |
| Given Name | 15% | Important but subject to nicknames |
| Address | 5% | Supporting evidence |
| Gender | 5% | Supporting evidence |
| Phone | 5% | Supporting evidence |

**Phonetic Matching** provides bonus points when names sound similar (e.g., "Stephen" vs "Steven").

## Research Basis

### Key Findings Applied

1. **No 100% Accuracy**: Research shows even the best algorithms achieve 90-98% accuracy. This crate aims for transparency with confidence scores.

2. **Standardization Critical**: All inputs are normalized:
   - Names: lowercase, remove diacritics, trim spaces
   - Postcodes: uppercase, remove spaces
   - Phone numbers: remove formatting, handle country codes
   - NHS numbers: digits only

3. **Multi-Factor Approach**: Following research recommendations, matching uses multiple demographic fields rather than relying on a single identifier.

4. **Weighted Probabilistic Matching**: Combines multiple weak identifiers into a strong match signal, following best practices from health information exchanges.

## Testing

Run the test suite:

```bash
# Unit tests
cargo test

# Integration tests
cargo test --test integration_tests

# Run with output
cargo test -- --nocapture

# Run specific test
cargo test test_fuzzy_name_match
```

### Test Coverage

- ✅ Perfect matches (100% score)
- ✅ Fuzzy name matching (typos, alternate spellings)
- ✅ Names with diacritics
- ✅ Phonetic name matching
- ✅ Phone number normalization
- ✅ Address comparison
- ✅ NHS number validation
- ✅ Deterministic matching
- ✅ Strict vs lenient modes
- ✅ Missing field handling
- ✅ Serialization/deserialization

## Example: Running the Demo

```bash
cargo run
```

This runs example scenarios including:
1. Perfect match
2. Fuzzy name match (Stephen vs Steven)
3. Names with diacritics (Siân vs Sian)
4. Address matching
5. Complete mismatch
6. Strict vs lenient comparison

## Performance Considerations

- **Time Complexity**: O(1) for deterministic matching, O(n) for string similarity
- **Memory**: Minimal allocation, uses borrowed references where possible
- **Concurrency**: Thread-safe, all operations are immutable

## Limitations

1. **No Machine Learning**: This is a rule-based system, not ML/AI
2. **Single Identifier Scheme**: Optimised for NHS-format check-digit identifiers; other national identifier schemes are not currently validated
3. **No Persistent Storage**: In-memory matching only
4. **No Batch Processing**: Processes pairs of patients

## Future Enhancements

- [ ] Support for other national identifiers (SSN, etc.)
- [ ] Batch matching API for large datasets
- [ ] Machine learning integration
- [ ] Performance benchmarks
- [ ] More sophisticated address parsing
- [ ] International phone number support

## License

MIT OR Apache-2.0

## Contributing

Contributions welcome! Please ensure:
- All tests pass (`cargo test`)
- Code is formatted (`cargo fmt`)
- No clippy warnings (`cargo clippy`)

## References

1. Grannis SJ, et al. "Patient Matching within a Health Information Exchange." AMIA Annu Symp Proc. 2014.
2. Reisman M. "Patient Identification Techniques – Approaches, Implications, and Findings." NCVHS. 2020.

## Contact

For questions, issues, or contributions, contact Joel Henderson at <joel@joelparkerhenderson.com>, or open an issue on the project repository.