zer-bench
Benchmark harness for zer, measures throughput, accuracy, and head-to-head comparisons against competitor libraries on Dutch administrative datasets.
- Documentation: docs.zal-analytics.ch
- Website: www.zal-analytics.ch
- Support & feedback: info@zal-analytics.ch
Install
For GPU-accelerated benchmarks pass a feature flag at install time:
Datasets and models
zer-bench resolves dataset and model paths from environment variables. Download the benchmark datasets from HuggingFace and point ZER_DATASET_DIR at the local copy:
Neural judge benchmarks (--judge) also need model files:
# ZER_MODEL_DIR defaults to ~/.cache/zer/models; override if needed
Environment variables
| Variable | Default | Description |
|---|---|---|
ZER_DATASET_DIR |
<workspace>/data |
Root directory for benchmark datasets downloaded from HuggingFace. Dataset paths are resolved as $ZER_DATASET_DIR/benchmarks/<scenario>/.... When unset, falls back to <workspace>/data (repo clone layout). |
ZER_MODEL_DIR |
~/.cache/zer/models |
Directory containing neural judge ONNX model files. Mirrors the layout from arsalan-anwari/zjudge on HuggingFace. |
ZER_EXTERNAL_BENCHMARKS_DIR |
<workspace>/benchmarks |
Root directory containing external library benchmark scripts. Scripts are resolved as $ZER_EXTERNAL_BENCHMARKS_DIR/<library>/<mode>/run.py (or run.R). Set this when running zer-bench library outside of a zer repository clone. Can also be passed as --external-benchmarks-dir. |
Subcommands
| Subcommand | Description |
|---|---|
throughput |
Raw compare/EM/score throughput on a single dataset |
accuracy |
Precision, recall, F1, and PR-AUC against labelled ground truth |
library |
Run a competitor library script and capture its summary CSV |
library-all |
Run all configured competitor libraries for a given mode and dataset |
compare |
Read multiple CSV summaries and print a side-by-side comparison table |
Quick start
# List available scenarios
# Accuracy; results written to bench_results/ by default
# Accuracy with neural judge (replace cuda with tensorrt / rocm / directml / openvino)
# Throughput, CPU; switch --target to avx2 / cuda / vulkan for GPU backends
# Run an external library (Splink) on the same scenario
# Same, but scripts live outside a zer repo clone
# Side-by-side comparison table
Feature flags
Compute backends
| Flag | Description |
|---|---|
cpu |
Scalar CPU fallback (always available without this flag too) |
avx2 |
x86_64 AVX2 SIMD (~4× throughput vs scalar CPU) |
cuda |
NVIDIA CUDA (SM 8.6+), requires CUDA Toolkit 13.1+ |
vulkan |
Vulkan 1.3 compute, requires Vulkan 1.3 driver |
Neural judge execution providers
Install the feature flag that matches the hardware you want to use, then pass
--judge-target <value> at runtime to select the execution provider:
| Feature flag | --judge-target value |
Description |
|---|---|---|
| (none) | cpu |
CPU ONNX Runtime (default when --judge-target is omitted) |
judge_cuda |
cuda |
NVIDIA CUDA ORT execution provider |
judge_tensorrt |
tensorrt |
TensorRT FP16 with engine caching, requires TensorRT 8.0+ |
judge_rocm |
rocm |
AMD ROCm ORT execution provider |
judge_directml |
directml |
Windows DirectML ORT execution provider |
judge_openvino |
openvino |
Intel OpenVINO ORT execution provider |
Example:
# Install with TensorRT support
# Run accuracy with TensorRT judge (--judge-target enables the judge automatically)
Diagnostics
| Flag | Description |
|---|---|
progress |
Print pipeline stage progress during accuracy runs |
perf-metrics |
Print per-phase timing metrics (blocking_ms, compare_ms, etc.) |
collect-pairs |
Collect all scored pairs after judging for unbiased PR-AUC |
nvtx |
Map tracing spans to Nsight Systems ranges (profiling only) |
Output format
Every accuracy, throughput, and library run appends a CSV row to the --out directory:
library,dataset,mode,precision,recall,f1,pr_auc,elapsed_ms,peak_memory_mb
zer,brp_persons,deduplicate,0.984,0.982,0.983,0.991,3653,163
Use zer-bench compare to aggregate rows from multiple runs into a formatted side-by-side table.
License
Apache-2.0 · GitHub