zer-pipeline

End-to-end entity resolution pipeline orchestration for the zer library.

Ties together ingestion, blocking, comparison, scoring, and clustering into a single Pipeline type. Supports deduplication (single source), linkage (two or more sources), and combined link-and-dedupe modes, with optional Tokio-based async progress events.

Documentation: docs.zal-analytics.ch
Website: www.zal-analytics.ch
Support & feedback: info@zal-analytics.ch

What it provides

Item	Description
`Pipeline` / `PipelineBuilder`	Fluent builder and executor for the full ER pipeline
`Ingester` / `IngestResult`	Loads records from CSV or any `IntoRecord` source
`PipelineConfig` / `LinkMode`	Controls link vs. dedupe mode, batch size, score threshold
`PipelineEvent`	Async progress events (blocking done, comparison done, etc.)
`ClusterView` / `ClusterIter`	Iterates over resolved entity clusters
`BatchReport`	Per-batch statistics (pairs generated, matched, time)

Feature flags

Flag	Effect
`collect-pairs`	Keeps all scored pairs in memory after judging for PR-AUC analysis; incurs allocation cost proportional to candidate count

Usage

use zer::pipeline::{Pipeline, PipelineBuilder, PipelineConfig, LinkMode};

let pipeline = PipelineBuilder::new(schema)
    .mode(LinkMode::Dedupe)
    .config(PipelineConfig::default())
    .build(blocker, comparator, scorer, clusterer);

let report = pipeline.run(ingester).await?;

Breaking changes

v1.1

LinkedPair, record_id_a/b replaced by record_key_a/b

LinkedPair no longer exposes raw numeric record IDs. The fields record_id_a: RecordId and record_id_b: RecordId are replaced by record_key_a: String and record_key_b: String, which hold the natural key values (e.g. BSN, KvK number) as supplied via DatasetConfig at ingestion time.

// v1.0
println!("{} ↔ {}", pair.record_id_a, pair.record_id_b);

// v1.1
println!("{} ↔ {}", pair.record_key_a, pair.record_key_b);

Evaluation code that built HashSet<(u64, u64)> from ground-truth integer IDs must be updated to HashSet<(String, String)> using natural key pairs.

License

Apache-2.0 · GitHub