weldrs
Pronounced "welders"
Fellegi-Sunter probabilistic record linkage in Rust, powered by Polars.
A Rust-native implementation inspired by the Splink Python package and the fastLink R package.
Features
- Blocking rules — reduce the comparison space with equi-join blocking on one or more columns
- Exact and fuzzy comparisons — Jaro-Winkler, Levenshtein, and Jaro similarity predicates alongside exact matching
- EM training — unsupervised Expectation-Maximization to learn m/u probabilities
- Fellegi-Sunter scoring — Bayes-factor match weights and match probabilities for every candidate pair
- Connected-components clustering — union-find grouping of linked records
- Model serialization — save and load trained model parameters as JSON
- Waterfall explanations — step-by-step breakdowns showing why each pair received its score
Quick Start
use *;
use ComparisonBuilder;
use *;
Concepts
weldrs implements the Fellegi-Sunter model of probabilistic record linkage:
-
Comparisons define how record pairs are evaluated. Each comparison targets one input column (e.g.
first_name) and contains multiple levels ordered from most to least specific (e.g. exact match, Jaro-Winkler >= 0.88, else). -
m-probability — the probability that a comparison level agrees, given the records are a true match.
-
u-probability — the probability that a comparison level agrees, given the records are not a match.
-
Bayes factor — the ratio m/u. A Bayes factor > 1 provides evidence towards a match; < 1 provides evidence against.
-
Match weight — log2 of the Bayes factor. The final match weight for a pair is the sum of the prior (from lambda) plus each comparison's individual match weight.
-
Lambda — the prior probability that two randomly chosen records are a match. Estimated from deterministic rules or set manually.
-
Blocking rules — equi-join predicates that restrict which record pairs are compared, making linkage tractable on large datasets.
Detailed Usage Guide
Defining comparisons
Use ComparisonBuilder to define how columns are compared. Levels are evaluated top-to-bottom; the first matching level wins.
use ComparisonBuilder;
let name_comparison = new
.null_level // both values null
.exact_match_level // exact string equality
.jaro_winkler_level // fuzzy: Jaro-Winkler >= 0.88
.else_level // everything else
.build;
let city_comparison = new
.null_level
.exact_match_level
.levenshtein_level // fuzzy: edit distance <= 2
.else_level
.build;
Available fuzzy predicates:
jaro_winkler_level(threshold)— Jaro-Winkler similarity >= thresholdjaro_level(threshold)— Jaro similarity >= thresholdlevenshtein_level(threshold)— Levenshtein edit distance <= threshold
Configuring settings
use *;
let settings = builder
.comparison
.comparison
.blocking_rule
.blocking_rule
.unique_id_column // default: "unique_id"
.probability_two_random_records_match // default: 0.0001
.build?;
LinkType options:
DedupeOnly— find duplicates within a single datasetLinkOnly— link records between two datasetsLinkAndDedupe— link and deduplicate combined
Blocking rules
Blocking rules define equi-join conditions to reduce the comparison space. Without blocking, every pair of records would be compared (quadratic).
use *;
// Block on a single column
let rule = on;
// Block on multiple columns (AND condition)
let strict_rule = on;
// Add an optional description
let rule = on.with_description;
Training
Training estimates the model parameters in three steps:
// 1. Estimate lambda (prior match probability) from deterministic rules
linker.estimate_probability_two_random_records_match?;
// 2. Estimate u-probabilities from random record pairs
linker.estimate_u_using_random_sampling?;
// 3. EM training passes — each pass fixes comparisons that overlap
// with the blocking rule and trains the rest
linker.estimate_parameters_using_em?;
linker.estimate_parameters_using_em?;
Prediction
// Score all candidate pairs (no threshold)
let predictions = linker.predict?.collect?;
// Score with a minimum match-weight threshold
let predictions = linker.predict?.collect?;
// Use the direct scorer (or PredictMode::Auto) for smaller candidate sets
let predictions = linker
.predict_with_mode?
.collect?;
The resulting DataFrame includes match_weight, match_probability, and individual bf_* columns for each comparison.
Clustering
// Group linked records into clusters using a match-probability threshold
let clusters = linker.cluster_pairwise_predictions?;
// Returns a DataFrame with columns: [unique_id, cluster_id]
Explaining predictions
Waterfall charts show exactly how each comparison contributed to a pair's score:
// Explain a single pair (by row index in the predictions DataFrame)
let chart = linker.explain_pair?;
for step in &chart.steps
// Explain multiple pairs at once
let charts = linker.explain_pairs?;
Inspecting the model
let summary = linker.model_summary;
println!;
for comp in &summary.comparisons
Saving and loading trained models
// Save trained model to JSON
let json = linker.save_settings_json?;
write?;
// Load a previously trained model
let json = read_to_string?;
let restored_linker = load_settings_json?;
// Use the restored linker for prediction — no retraining needed
let predictions = restored_linker.predict?.collect?;
Examples
Run any example with cargo run --example <name>:
| Example | Description |
|---|---|
cargo run --example basic_dedup |
Full pipeline tutorial with a 10-row dataset |
cargo run --example fuzzy_matching |
Compare exact-only vs. fuzzy comparisons on 1K rows |
cargo run --example explain_predictions |
Waterfall explanation of top/bottom scoring pairs |
cargo run --example model_parameters |
Inspect trained m/u/BF parameters |
cargo run --example save_and_load |
Serialize and restore a trained model |
cargo run --example scaling --release |
Performance benchmark (default 100K rows) |
License
MIT