Benchmark harness for evaluating Zeph agent performance on standardized datasets.
zeph-bench implements the CLI subcommand zeph bench and provides the building blocks
for running reproducible evaluations against LOCOMO, FRAMES, GAIA, and other datasets.
Architecture
The harness is built around three composable traits:
- [
DatasetLoader] — reads a dataset file and returns a [Vec<Scenario>]. - [
Evaluator] — scores one agent response against a [Scenario]. - [
zeph_core::channel::Channel] — implemented by [BenchmarkChannel] to drive the agent loop headlessly (no terminal, no network).
Results are accumulated into a [BenchRun] and persisted by [ResultWriter], which writes
both results.json (machine-readable) and summary.md (human-readable) to the output
directory. Runs can be interrupted and resumed via the --resume flag.
Quick Start
use std::path::Path;
use zeph_bench::{DatasetRegistry, loaders::{LocomoLoader, LocomoEvaluator}};
use zeph_bench::scenario::{DatasetLoader, Evaluator};
// 1. Discover available datasets.
let registry = DatasetRegistry::new();
let meta = registry.get("locomo").expect("locomo is built-in");
println!("dataset url: {}", meta.url);
// 2. Load scenarios from a locally cached file.
let scenarios = LocomoLoader.load(Path::new("/data/locomo.json")).unwrap();
// 3. Evaluate a response.
let result = LocomoEvaluator.evaluate(&scenarios[0], "some agent response");
println!("score={:.4} passed={}", result.score, result.passed);
Deterministic Runs
By default the harness forces temperature=0.0 on the configured provider so that runs are
reproducible. Pass --no-deterministic on the CLI or call [apply_deterministic_overrides]
with no_deterministic = true to disable this behaviour.
Modules
| Module | Purpose |
|---|---|
[baseline] |
Baseline comparison types and delta computation |
[channel] |
Headless [BenchmarkChannel] that drives the agent without I/O |
[cli] |
Clap subcommand definition ([BenchCommand]) |
[dataset] |
Dataset registry and metadata types |
[deterministic] |
Temperature-zero override helpers |
[error] |
[BenchError] error type |
[isolation] |
Per-scenario storage isolation ([BenchIsolation]) |
[loaders] |
Concrete loaders for LOCOMO, FRAMES, GAIA, LongMemEval, and tau-bench |
[results] |
Result types and [ResultWriter] |
[scenario] |
Core traits ([DatasetLoader], [Evaluator]) and scoring helpers |