brainwires-eval 0.11.0

Evaluation harness for the Brainwires Agent Framework — fixtures, regression suites, stability tests, adversarial cases, ranking metrics (NDCG / MRR / precision@k), recorder.

brainwires-eval

Evaluation harness for the Brainwires Agent Framework.

Overview

A self-contained framework for writing and running evaluation cases against agents (or anything else). Cases are deterministic where possible; when they're not, ranking metrics measure quality.

Originally an internal-only brainwires-eval module, then folded into brainwires-agent, re-extracted in 0.11 (Phase 11e) so the framework's evaluation surface is its own dependency-free crate. Zero brainwires-* deps internally.

Modules

case — EvaluationCase trait
trial — TrialResult + EvaluationStats
suite — EvaluationSuite + SuiteResult (Monte Carlo runner with Wilson confidence intervals)
fixtures — YAML-based fixture cases (tests/fixtures/*.yaml)
regression — RegressionSuite for change-detection runs
stability_tests — flakiness detection across repeated runs
adversarial — adversarial case generation
recorder — recording trial results to disk
fault_report — structured fault reports
ranking_metrics — ndcg_at_k, mrr, precision_at_k (with graded relevance support)

Migration from `brainwires-agent::eval`

# Before
brainwires-agent = { features = ["eval"] }

# After
brainwires-eval = "0.11"

// Before
use brainwires_agent::eval::{EvaluationCase, TrialResult, ndcg_at_k};

// After
use brainwires_eval::{EvaluationCase, TrialResult, ndcg_at_k};

License