Skip to main content

Crate zeph_bench

Crate zeph_bench 

Source
Expand description

Benchmark harness for evaluating Zeph agent performance on standardized datasets.

zeph-bench implements the CLI subcommand zeph bench and provides the building blocks for running reproducible evaluations against LOCOMO, FRAMES, GAIA, and other datasets.

§Architecture

The harness is built around three composable traits:

Results are accumulated into a BenchRun and persisted by ResultWriter, which writes both results.json (machine-readable) and summary.md (human-readable) to the output directory. Runs can be interrupted and resumed via the --resume flag.

§Quick Start

use std::path::Path;
use zeph_bench::{DatasetRegistry, loaders::{LocomoLoader, LocomoEvaluator}};
use zeph_bench::scenario::{DatasetLoader, Evaluator};

// 1. Discover available datasets.
let registry = DatasetRegistry::new();
let meta = registry.get("locomo").expect("locomo is built-in");
println!("dataset url: {}", meta.url);

// 2. Load scenarios from a locally cached file.
let scenarios = LocomoLoader.load(Path::new("/data/locomo.json")).unwrap();

// 3. Evaluate a response.
let result = LocomoEvaluator.evaluate(&scenarios[0], "some agent response");
println!("score={:.4} passed={}", result.score, result.passed);

§Deterministic Runs

By default the harness forces temperature=0.0 on the configured provider so that runs are reproducible. Pass --no-deterministic on the CLI or call apply_deterministic_overrides with no_deterministic = true to disable this behaviour.

§Modules

ModulePurpose
baselineBaseline comparison types and delta computation
channelHeadless BenchmarkChannel that drives the agent without I/O
cliClap subcommand definition (BenchCommand)
datasetDataset registry and metadata types
deterministicTemperature-zero override helpers
errorBenchError error type
isolationPer-scenario storage isolation (BenchIsolation)
loadersConcrete loaders for LOCOMO, FRAMES, GAIA, LongMemEval, and tau-bench
resultsResult types and ResultWriter
runnerBenchRunner that drives the agent loop over a dataset
scenarioCore traits (DatasetLoader, Evaluator) and scoring helpers

Re-exports§

pub use baseline::BaselineComparison;
pub use baseline::ScenarioDelta;
pub use channel::BenchmarkChannel;
pub use cli::BenchCommand;
pub use dataset::DatasetFormat;
pub use dataset::DatasetMeta;
pub use dataset::DatasetRegistry;
pub use deterministic::apply_deterministic_overrides;
pub use error::BenchError;
pub use isolation::BenchIsolation;
pub use results::Aggregate;
pub use results::BenchRun;
pub use results::ResultWriter;
pub use results::RunStatus;
pub use results::ScenarioResult;
pub use runner::BenchMemoryParams;
pub use runner::BenchRunner;
pub use runner::MemoryMode;
pub use runner::ResponseMode;
pub use runner::RunOptions;
pub use scenario::DatasetLoader;
pub use scenario::EvalResult;
pub use scenario::Evaluator;
pub use scenario::Role;
pub use scenario::Scenario;
pub use scenario::Turn;
pub use scenario::exact_match;
pub use scenario::gaia_normalized_exact_match;
pub use scenario::token_f1;

Modules§

baseline
channel
Headless zeph_core::channel::Channel implementation for benchmark runs.
cli
Clap subcommand definitions for zeph bench.
dataset
deterministic
Helpers for pinning generation parameters to reproducible values.
error
isolation
loaders
Concrete DatasetLoader and Evaluator implementations for each built-in dataset.
results
Benchmark result types and writer.
runner
Benchmark runner: drives Agent<BenchmarkChannel> over a dataset and collects results.
scenario