Crate zeph_bench

Expand description

Benchmark harness for evaluating Zeph agent performance on standardized datasets.

zeph-bench implements the CLI subcommand zeph bench and provides the building blocks for running reproducible evaluations against LOCOMO, FRAMES, GAIA, and other datasets.

§Architecture

The harness is built around three composable traits:

DatasetLoader — reads a dataset file and returns a Vec<Scenario>.
Evaluator — scores one agent response against a Scenario.
zeph_core::channel::Channel — implemented by BenchmarkChannel to drive the agent loop headlessly (no terminal, no network).

Results are accumulated into a BenchRun and persisted by ResultWriter, which writes both results.json (machine-readable) and summary.md (human-readable) to the output directory. Runs can be interrupted and resumed via the --resume flag.

§Quick Start

use std::path::Path;
use zeph_bench::{DatasetRegistry, loaders::{LocomoLoader, LocomoEvaluator}};
use zeph_bench::scenario::{DatasetLoader, Evaluator};

// 1. Discover available datasets.
let registry = DatasetRegistry::new();
let meta = registry.get("locomo").expect("locomo is built-in");
println!("dataset url: {}", meta.url);

// 2. Load scenarios from a locally cached file.
let scenarios = LocomoLoader.load(Path::new("/data/locomo.json")).unwrap();

// 3. Evaluate a response.
let result = LocomoEvaluator.evaluate(&scenarios[0], "some agent response");
println!("score={:.4} passed={}", result.score, result.passed);

§Deterministic Runs

By default the harness forces temperature=0.0 on the configured provider so that runs are reproducible. Pass --no-deterministic on the CLI or call apply_deterministic_overrides with no_deterministic = true to disable this behaviour.

§Modules

Module	Purpose
`baseline`	Baseline comparison types and delta computation
`channel`	Headless `BenchmarkChannel` that drives the agent without I/O
`cli`	Clap subcommand definition (`BenchCommand`)
`dataset`	Dataset registry and metadata types
`deterministic`	Temperature-zero override helpers
`error`	`BenchError` error type
`isolation`	Per-scenario storage isolation (`BenchIsolation`)
`loaders`	Concrete loaders for LOCOMO, FRAMES, GAIA, LongMemEval, and tau-bench
`results`	Result types and `ResultWriter`
`runner`	`BenchRunner` that drives the agent loop over a dataset
`scenario`	Core traits (`DatasetLoader`, `Evaluator`) and scoring helpers

Re-exports§

pub use baseline::BaselineComparison;
pub use baseline::ScenarioDelta;
pub use channel::BenchmarkChannel;
pub use cli::BenchCommand;
pub use dataset::DatasetFormat;
pub use dataset::DatasetMeta;
pub use dataset::DatasetRegistry;
pub use deterministic::apply_deterministic_overrides;
pub use error::BenchError;
pub use isolation::BenchIsolation;
pub use results::Aggregate;
pub use results::BenchRun;
pub use results::ResultWriter;
pub use results::RunStatus;
pub use results::ScenarioResult;
pub use runner::BenchMemoryParams;
pub use runner::BenchRunner;
pub use runner::MemoryMode;
pub use runner::ResponseMode;
pub use runner::RunOptions;
pub use scenario::DatasetLoader;
pub use scenario::EvalResult;
pub use scenario::Evaluator;
pub use scenario::Role;
pub use scenario::Scenario;
pub use scenario::Turn;
pub use scenario::exact_match;
pub use scenario::gaia_normalized_exact_match;
pub use scenario::token_f1;

Modules§

baseline
channel: Headless zeph_core::channel::Channel implementation for benchmark runs.
cli: Clap subcommand definitions for zeph bench.
dataset
deterministic: Helpers for pinning generation parameters to reproducible values.
error
isolation
loaders: Concrete DatasetLoader and Evaluator implementations for each built-in dataset.
results: Benchmark result types and writer.
runner: Benchmark runner: drives Agent<BenchmarkChannel> over a dataset and collects results.
scenario

Crate zeph_bench

Crate zeph_bench Copy item path

§Architecture

§Quick Start

§Deterministic Runs

§Modules

Re-exports§

Modules§

Crate zeph_bench