zeph-bench 0.19.1

Benchmark harness for evaluating Zeph agent performance on standardized datasets
Documentation

Benchmark harness for evaluating Zeph agent performance on standardized datasets.

zeph-bench implements the CLI subcommand zeph bench and provides the building blocks for running reproducible evaluations against LOCOMO, FRAMES, GAIA, and other datasets.

Architecture

The harness is built around three composable traits:

  • [DatasetLoader] — reads a dataset file and returns a [Vec<Scenario>].
  • [Evaluator] — scores one agent response against a [Scenario].
  • [zeph_core::channel::Channel] — implemented by [BenchmarkChannel] to drive the agent loop headlessly (no terminal, no network).

Results are accumulated into a [BenchRun] and persisted by [ResultWriter], which writes both results.json (machine-readable) and summary.md (human-readable) to the output directory. Runs can be interrupted and resumed via the --resume flag.

Quick Start

use std::path::Path;
use zeph_bench::{DatasetRegistry, loaders::{LocomoLoader, LocomoEvaluator}};
use zeph_bench::scenario::{DatasetLoader, Evaluator};

// 1. Discover available datasets.
let registry = DatasetRegistry::new();
let meta = registry.get("locomo").expect("locomo is built-in");
println!("dataset url: {}", meta.url);

// 2. Load scenarios from a locally cached file.
let scenarios = LocomoLoader.load(Path::new("/data/locomo.json")).unwrap();

// 3. Evaluate a response.
let result = LocomoEvaluator.evaluate(&scenarios[0], "some agent response");
println!("score={:.4} passed={}", result.score, result.passed);

Deterministic Runs

By default the harness forces temperature=0.0 on the configured provider so that runs are reproducible. Pass --no-deterministic on the CLI or call [apply_deterministic_overrides] with no_deterministic = true to disable this behaviour.

Modules

Module Purpose
[baseline] Baseline comparison types and delta computation
[channel] Headless [BenchmarkChannel] that drives the agent without I/O
[cli] Clap subcommand definition ([BenchCommand])
[dataset] Dataset registry and metadata types
[deterministic] Temperature-zero override helpers
[error] [BenchError] error type
[isolation] Per-scenario storage isolation ([BenchIsolation])
[loaders] Concrete loaders for LOCOMO, FRAMES, GAIA, LongMemEval, and tau-bench
[results] Result types and [ResultWriter]
[scenario] Core traits ([DatasetLoader], [Evaluator]) and scoring helpers