# weldrs
*Pronounced "welders"*
Fellegi-Sunter probabilistic record linkage in Rust, powered by [Polars](https://pola.rs/).
A Rust-native implementation inspired by the [Splink](https://github.com/moj-analytical-services/splink) Python package and the [fastLink](https://github.com/kosukeimai/fastLink) R package.
## Features
- **Blocking rules** — reduce the comparison space with equi-join blocking on one or more columns
- **Exact and fuzzy comparisons** — Jaro-Winkler, Levenshtein, and Jaro similarity predicates alongside exact matching
- **EM training** — unsupervised Expectation-Maximization to learn m/u probabilities
- **Fellegi-Sunter scoring** — Bayes-factor match weights and match probabilities for every candidate pair
- **Connected-components clustering** — union-find grouping of linked records
- **Model serialization** — save and load trained model parameters as JSON
- **Waterfall explanations** — step-by-step breakdowns showing why each pair received its score
## Quick Start
```rust
use polars::prelude::*;
use weldrs::comparison::ComparisonBuilder;
use weldrs::prelude::*;
fn main() -> Result<()> {
// 1. Prepare your data as a Polars DataFrame
let df = df!(
"unique_id" => [1i64, 2, 3, 4],
"first_name" => ["John", "Jane", "Jon", "Jane"],
"last_name" => ["Smith", "Doe", "Smith", "Doe"],
)?;
// 2. Define comparisons and build settings
let settings = Settings::builder(LinkType::DedupeOnly)
.comparison(
ComparisonBuilder::new("first_name")
.null_level()
.exact_match_level()
.jaro_winkler_level(0.88)
.else_level()
.build(),
)
.comparison(
ComparisonBuilder::new("last_name")
.null_level()
.exact_match_level()
.else_level()
.build(),
)
.blocking_rule(BlockingRule::on(&["last_name"]))
.build()?;
// 3. Train the model
let mut linker = Linker::new(settings)?;
let lf = df.lazy();
linker.estimate_probability_two_random_records_match(
&lf,
&[BlockingRule::on(&["first_name", "last_name"])],
1.0,
)?;
linker.estimate_u_using_random_sampling(&lf, 200)?;
linker.estimate_parameters_using_em(&lf, &BlockingRule::on(&["last_name"]))?;
// 4. Predict — score all candidate pairs
let predictions = linker.predict(&lf, None)?.collect()?;
// 5. Cluster — group linked records
let clusters = linker.cluster_pairwise_predictions(&predictions, 0.5)?;
println!("{clusters}");
Ok(())
}
```
## Concepts
**weldrs** implements the Fellegi-Sunter model of probabilistic record linkage:
- **Comparisons** define how record pairs are evaluated. Each comparison targets one input column (e.g. `first_name`) and contains multiple **levels** ordered from most to least specific (e.g. exact match, Jaro-Winkler >= 0.88, else).
- **m-probability** — the probability that a comparison level agrees, given the records *are* a true match.
- **u-probability** — the probability that a comparison level agrees, given the records *are not* a match.
- **Bayes factor** — the ratio m/u. A Bayes factor > 1 provides evidence towards a match; < 1 provides evidence against.
- **Match weight** — log2 of the Bayes factor. The final match weight for a pair is the sum of the prior (from lambda) plus each comparison's individual match weight.
- **Lambda** — the prior probability that two randomly chosen records are a match. Estimated from deterministic rules or set manually.
- **Blocking rules** — equi-join predicates that restrict which record pairs are compared, making linkage tractable on large datasets.
## Detailed Usage Guide
### Defining comparisons
Use `ComparisonBuilder` to define how columns are compared. Levels are evaluated top-to-bottom; the first matching level wins.
```rust
use weldrs::comparison::ComparisonBuilder;
let name_comparison = ComparisonBuilder::new("first_name")
.null_level() // both values null
.exact_match_level() // exact string equality
.jaro_winkler_level(0.88) // fuzzy: Jaro-Winkler >= 0.88
.else_level() // everything else
.build();
let city_comparison = ComparisonBuilder::new("city")
.null_level()
.exact_match_level()
.levenshtein_level(2) // fuzzy: edit distance <= 2
.else_level()
.build();
```
Available fuzzy predicates:
- `jaro_winkler_level(threshold)` — Jaro-Winkler similarity >= threshold
- `jaro_level(threshold)` — Jaro similarity >= threshold
- `levenshtein_level(threshold)` — Levenshtein edit distance <= threshold
### Configuring settings
```rust
use weldrs::prelude::*;
let settings = Settings::builder(LinkType::DedupeOnly)
.comparison(name_comparison)
.comparison(city_comparison)
.blocking_rule(BlockingRule::on(&["last_name"]))
.blocking_rule(BlockingRule::on(&["city"]))
.unique_id_column("record_id") // default: "unique_id"
.probability_two_random_records_match(0.001) // default: 0.0001
.build()?;
```
`LinkType` options:
- `DedupeOnly` — find duplicates within a single dataset
- `LinkOnly` — link records between two datasets
- `LinkAndDedupe` — link and deduplicate combined
### Blocking rules
Blocking rules define equi-join conditions to reduce the comparison space. Without blocking, every pair of records would be compared (quadratic).
```rust
use weldrs::prelude::*;
// Block on a single column
let rule = BlockingRule::on(&["last_name"]);
// Block on multiple columns (AND condition)
let strict_rule = BlockingRule::on(&["first_name", "last_name"]);
// Add an optional description
let rule = BlockingRule::on(&["city"]).with_description("Same city");
```
### Training
Training estimates the model parameters in three steps:
```rust
// 1. Estimate lambda (prior match probability) from deterministic rules
linker.estimate_probability_two_random_records_match(
&lf,
&[BlockingRule::on(&["first_name", "last_name"])],
1.0, // assumed recall
)?;
// 2. Estimate u-probabilities from random record pairs
linker.estimate_u_using_random_sampling(&lf, 1_000)?;
// 3. EM training passes — each pass fixes comparisons that overlap
// with the blocking rule and trains the rest
linker.estimate_parameters_using_em(&lf, &BlockingRule::on(&["last_name"]))?;
linker.estimate_parameters_using_em(&lf, &BlockingRule::on(&["city"]))?;
```
### Prediction
```rust
// Score all candidate pairs (no threshold)
let predictions = linker.predict(&lf, None)?.collect()?;
// Score with a minimum match-weight threshold
let predictions = linker.predict(&lf, Some(0.0))?.collect()?;
// Use the direct scorer (or PredictMode::Auto) for smaller candidate sets
let predictions = linker
.predict_with_mode(&lf, None, PredictMode::Direct)?
.collect()?;
```
The resulting DataFrame includes `match_weight`, `match_probability`, and individual `bf_*` columns for each comparison.
### Clustering
```rust
// Group linked records into clusters using a match-probability threshold
let clusters = linker.cluster_pairwise_predictions(&predictions, 0.5)?;
// Returns a DataFrame with columns: [unique_id, cluster_id]
```
### Explaining predictions
Waterfall charts show exactly how each comparison contributed to a pair's score:
```rust
// Explain a single pair (by row index in the predictions DataFrame)
let chart = linker.explain_pair(&predictions, 0)?;
for step in &chart.steps {
println!("{}: {} (weight: {:.2})", step.column_name, step.label, step.log2_bayes_factor);
}
// Explain multiple pairs at once
let charts = linker.explain_pairs(&predictions, &[0, 1, 2])?;
```
### Inspecting the model
```rust
let summary = linker.model_summary();
println!("Lambda: {:.6}", summary.probability_two_random_records_match);
for comp in &summary.comparisons {
println!("Comparison: {}", comp.output_column_name);
for level in &comp.levels {
println!(" {} — BF: {:?}, weight: {:?}",
level.label, level.bayes_factor, level.log2_bayes_factor);
}
}
```
### Saving and loading trained models
```rust
// Save trained model to JSON
let json = linker.save_settings_json()?;
std::fs::write("model.json", &json)?;
// Load a previously trained model
let json = std::fs::read_to_string("model.json")?;
let restored_linker = Linker::load_settings_json(&json)?;
// Use the restored linker for prediction — no retraining needed
let predictions = restored_linker.predict(&lf, None)?.collect()?;
```
## Examples
Run any example with `cargo run --example <name>`:
| `cargo run --example basic_dedup` | Full pipeline tutorial with a 10-row dataset |
| `cargo run --example fuzzy_matching` | Compare exact-only vs. fuzzy comparisons on 1K rows |
| `cargo run --example explain_predictions` | Waterfall explanation of top/bottom scoring pairs |
| `cargo run --example model_parameters` | Inspect trained m/u/BF parameters |
| `cargo run --example save_and_load` | Serialize and restore a trained model |
| `cargo run --example scaling --release` | Performance benchmark (default 100K rows) |
## License
MIT