zer-bench
Benchmark harness for zer, measures throughput, accuracy, and head-to-head comparisons against competitor libraries on Dutch administrative datasets.
- Documentation: docs.zal-analytics.ch
- Website: www.zal-analytics.ch
- Support & feedback: info@zal-analytics.ch
Install
For GPU-accelerated benchmarks pass the matching feature flag(s) at install time:
To install with every compute backend and judge provider enabled:
Once installed, --target and --judge-target switch backends at runtime no rebuild needed.
Datasets and models
zer-bench resolves dataset and model paths from environment variables. Download the benchmark datasets from HuggingFace and point ZER_DATASET_DIR at the local copy:
Neural judge benchmarks (--judge-target) also need model files:
# ZER_MODEL_DIR defaults to ~/.cache/zer/models; override if needed
External library benchmarks
Head-to-head comparisons with competitor libraries (--compare-libs) require the external benchmark scripts. Clone the repository and point ZER_EXTERNAL_BENCHMARKS_DIR at it:
Or pass the path directly at runtime:
Environment variables
| Variable | Default | Description |
|---|---|---|
ZER_DATASET_DIR |
<workspace>/data |
Root directory for benchmark datasets. Paths resolve as $ZER_DATASET_DIR/benchmarks/<scenario>/.... Falls back to <workspace>/data when unset. |
ZER_MODEL_DIR |
~/.cache/zer/models |
Directory containing neural judge ONNX model files. Mirrors the layout from arsalan-anwari/zjudge on HuggingFace. |
ZER_EXTERNAL_BENCHMARKS_DIR |
<workspace>/benchmarks |
Root directory for external library benchmark scripts. Scripts resolve as $ZER_EXTERNAL_BENCHMARKS_DIR/<library>/<mode>/run.py. Can also be passed as --external-benchmarks-dir. |
Subcommands
| Subcommand | Description |
|---|---|
accuracy |
Precision, recall, F1, and PR-AUC against labelled ground truth |
throughput |
Raw compare/EM/score throughput on a single dataset |
compare |
Read multiple CSV summaries and print a side-by-side comparison table |
plot |
Generate plots from benchmark JSON files via plot_results.py |
Competitor libraries are run inline via --compare-libs on both accuracy and throughput no separate subcommand needed.
Quick start
List available scenarios
Accuracy
# Single scenario datasets, mode, and ground truth wired up automatically
# All 8 full-size scenarios back-to-back
# zer vs splink (runs both, prints inline comparison table)
# zer vs splink across all scenarios
Judge dual-pass
When --judge-target is supplied, zer-bench automatically runs zer without judge then zer with judge, then prints a side-by-side comparison table. No extra flags needed.
# CPU judge dual-pass (no extra feature flag needed)
# TensorRT judge (requires --features judge_tensorrt at build time)
# TensorRT judge vs splink baseline 3 results per scenario: zer, zer+judge, splink
# All 8 scenarios × (zer + zer+judge + splink) 24 runs total, one table per scenario
Throughput
# CPU throughput (dedupe scenarios only)
# CUDA throughput (requires --features cuda)
# All dedupe scenarios back-to-back (brp/dedupe and kvk/dedupe)
# zer vs splink throughput
# CUDA throughput + TensorRT judge dual-pass
Comparing existing results
# Print a table from all summary CSVs in a directory
# Filter by mode and dataset
Plotting
Helper script
scripts/run_benchmark.sh is a thin driver that selects the correct Cargo features based on --target and --judge-target (backends must be compiled in), generates a timestamped output directory, then forwards everything to zer-bench:
# Equivalent to the direct zer-bench calls above
Use the script during development. For a pre-built all-features binary, call zer-bench directly.
Feature flags
Compute backends
| Flag | --target value |
Description |
|---|---|---|
| (none) | cpu or auto |
Scalar CPU fallback, always available |
avx2 |
avx2 |
x86-64 AVX2 SIMD (~4× vs scalar CPU) |
cuda |
cuda |
NVIDIA CUDA, requires CUDA Toolkit 13.1+ |
vulkan |
vulkan |
Vulkan 1.3 compute |
Neural judge execution providers
| Feature flag | --judge-target value |
Description |
|---|---|---|
| (none) | cpu |
CPU ONNX Runtime, always available |
judge_cuda |
cuda |
NVIDIA CUDA ORT execution provider |
judge_tensorrt |
tensorrt |
TensorRT FP16 with engine caching, requires TensorRT 8.0+ |
judge_rocm |
rocm |
AMD ROCm ORT execution provider |
judge_directml |
directml |
Windows DirectML ORT execution provider |
judge_openvino |
openvino |
Intel OpenVINO ORT execution provider |
Diagnostics
| Flag | Description |
|---|---|
progress |
Print pipeline stage progress during accuracy runs |
perf-metrics |
Print per-phase timing metrics (blocking_ms, compare_ms, etc.) |
collect-pairs |
Collect all scored pairs after judging for unbiased PR-AUC (on by default) |
nvtx |
Map tracing spans to Nsight Systems ranges (profiling only) |
Available scenarios
Accuracy (zer-bench accuracy --list-scenarios)
| Scenario | Mode | Description |
|---|---|---|
brp/dedupe |
deduplicate | BRP person deduplication |
brp/link |
link-only | BRP → external source linkage |
brp/link_and_dedupe |
link-and-dedupe | BRP simultaneous dedup + link |
brp_kvk/link |
link-only | BRP × KVK cross-schema linkage |
brp_sis/link |
link-only | BRP × SIS cross-schema linkage |
brp_hks/link |
link-only | BRP × HKS cross-schema linkage |
brp_kvk_hks/link_and_dedupe |
link-and-dedupe | BRP × KVK × HKS multi-source |
kvk/dedupe |
deduplicate | KVK business-register deduplication |
--scenario all runs all 8. Micro/smoke-test variants are also listed by --list-scenarios.
Throughput (zer-bench throughput --list-scenarios)
Throughput only supports dedupe scenarios (brp/dedupe, kvk/dedupe).
--scenario all runs both back-to-back.
Output format
Every accuracy and throughput run writes to --out:
| File | Description |
|---|---|
<run_id>_summary.csv |
Single-row CSV consumed by zer-bench compare |
<run_id>_benchmark.json |
Full metadata: metrics, timings, memory snapshots |
<run_id>_scored_pairs.csv |
(score, is_match) pairs for PR curve plotting (accuracy only) |
Use zer-bench compare --results <dir> to aggregate rows from multiple runs into a formatted table.
License
Apache-2.0 · GitHub