zeph-bench
Benchmark harness for evaluating Zeph agent performance on standardized datasets.
Feeds LOCOMO, GAIA, FRAMES, LongMemEval, and tau-bench tasks through the full Zeph agent loop and records correctness, latency, and token usage. Designed for reproducible baseline evaluation: no tools, no memory, no MCP — raw model capability only.
Baseline Results
gpt-5.4-mini, baseline mode, 2026-04-25:
| Dataset | Scorer | Scenarios | Mean score | Exact match |
|---|---|---|---|---|
| LOCOMO | Token F1 ≥ 0.5 | 11 | 1.0000 | 11/11 |
| GAIA | GAIA normalized exact | 8 | 1.0000 | 8/8 |
| FRAMES | Normalized exact match | 7 | 1.0000 | 7/7 |
| LongMemEval | Exact match + Token F1 | 6 | 1.0000 | 6/6 |
| tau-bench | Task completion (exact) | 5 | 1.0000 | 5/5 |
[!NOTE] Baseline mode injects a concise-answer system prompt and post-processes responses (first-line extraction, markdown strip) before scoring. This is the primary driver of score quality — without it, verbose answers fail both Token F1 and exact-match evaluators.
CLI Usage
zeph-bench is invoked through the main zeph binary (requires the bench feature):
# List available datasets
# Run GAIA sample
# Run a single scenario for debugging
# Resume an interrupted run
[!TIP]
--providerreferences a named entry from[[llm.providers]]in your config. If omitted, the default provider is used. Use a fast, cheap model for large evaluation runs.
Output directory receives two files: results.json (machine-readable) and summary.md
(human-readable markdown table).
Library Usage
use Path;
use ;
use ;
use ;
# async
Implementing a custom dataset
use ;
use Path;
;
;
Supported Datasets
| Dataset | Format | Scorer | Status |
|---|---|---|---|
| LOCOMO | JSON | Token F1 ≥ 0.5 | Ready |
| GAIA | JSONL | Normalized exact match | Ready |
| FRAMES | JSONL | Normalized exact match | Ready |
| LongMemEval | JSONL | Exact match + Token F1 | Ready |
| tau-bench | JSON | Task completion (exact) | Ready |
[!IMPORTANT] Requires Rust 1.95 or later.
Architecture
The harness is built on three composable traits:
DatasetLoader— reads a dataset file, returnsVec<Scenario>Evaluator— scores one agent response against aScenarioBenchmarkChannel— headlessChannelimpl that drives the agent loop without a terminal
BenchRunner wires them together: one fresh Agent<BenchmarkChannel> per scenario, no shared
state between runs. Results accumulate into a BenchRun and are persisted by ResultWriter.
Installation
[]
= "0.20"
This crate is part of the Zeph workspace. See the API documentation for the complete reference.
License
Licensed under MIT OR Apache-2.0 — see LICENSE for details.