zer-pipeline
End-to-end entity resolution pipeline orchestration for the zer library.
Ties together ingestion, blocking, comparison, scoring, and clustering into a single Pipeline type. Supports deduplication (single source), linkage (two or more sources), and combined link-and-dedupe modes, with optional Tokio-based async progress events.
- Documentation: docs.zal-analytics.ch
- Website: www.zal-analytics.ch
- Support & feedback: info@zal-analytics.ch
What it provides
| Item | Description |
|---|---|
Pipeline / PipelineBuilder |
Fluent builder and executor for the full ER pipeline |
Ingester / IngestResult |
Loads records from CSV or any IntoRecord source |
PipelineConfig / LinkMode |
Controls link vs. dedupe mode, batch size, score threshold |
PipelineEvent |
Async progress events (blocking done, comparison done, etc.) |
ClusterView / ClusterIter |
Iterates over resolved entity clusters |
BatchReport |
Per-batch statistics (pairs generated, matched, time) |
Feature flags
| Flag | Effect |
|---|---|
collect-pairs |
Keeps all scored pairs in memory after judging for PR-AUC analysis; incurs allocation cost proportional to candidate count |
Usage
use ;
let pipeline = new
.mode
.config
.build;
let report = pipeline.run.await?;
Breaking changes
v1.1
LinkedPair, record_id_a/b replaced by record_key_a/b
LinkedPair no longer exposes raw numeric record IDs. The fields record_id_a: RecordId and record_id_b: RecordId are replaced by record_key_a: String and record_key_b: String, which hold the natural key values (e.g. BSN, KvK number) as supplied via DatasetConfig at ingestion time.
// v1.0
println!;
// v1.1
println!;
Evaluation code that built HashSet<(u64, u64)> from ground-truth integer IDs must be updated to HashSet<(String, String)> using natural key pairs.
License
Apache-2.0 · GitHub