Expand description
§brainwires-eval
Evaluation framework for Brainwires agents.
§What’s included
| Module | Key type | Purpose |
|---|---|---|
trial | TrialResult, EvaluationStats | Per-trial results + Wilson-score 95 % CI |
case | EvaluationCase | Trait for a single evaluatable scenario |
suite | EvaluationSuite, SuiteResult | N-trial Monte Carlo runner |
recorder | ToolSequenceRecorder, SequenceDiff | Record + diff tool call sequences |
adversarial | AdversarialTestCase | Prompt injection, ambiguity, budget stress |
Re-exports§
pub use trial::ConfidenceInterval95;pub use trial::EvaluationStats;pub use trial::TrialResult;pub use case::AlwaysFailCase;pub use case::AlwaysPassCase;pub use case::EvaluationCase;pub use case::StochasticCase;pub use suite::EvaluationSuite;pub use suite::SuiteConfig;pub use suite::SuiteResult;pub use recorder::SequenceDiff;pub use recorder::ToolCallRecord;pub use recorder::ToolSequenceRecorder;pub use adversarial::AdversarialTestCase;pub use adversarial::AdversarialTestType;pub use regression::CategoryBaseline;pub use regression::CategoryRegressionResult;pub use regression::RegressionConfig;pub use regression::RegressionResult;pub use regression::RegressionSuite;pub use stability_tests::GoalPreservationCase;pub use stability_tests::LoopDetectionSimCase;pub use stability_tests::long_horizon_stability_suite;pub use fault_report::FaultKind;pub use fault_report::FaultReport;pub use fault_report::analyze_suite_for_faults;pub use fixtures::Assertion;pub use fixtures::ExpectedBehavior;pub use fixtures::Fixture;pub use fixtures::FixtureCase;pub use fixtures::FixtureMessage;pub use fixtures::FixtureRunner;pub use fixtures::RunOutcome;pub use fixtures::load_fixture_file;pub use fixtures::load_fixtures_from_dir;pub use ranking_metrics::mrr;pub use ranking_metrics::ndcg_at_k;pub use ranking_metrics::precision_at_k;
Modules§
- adversarial
- Adversarial test cases for robustness evaluation.
- case
- The
EvaluationCasetrait — the unit of evaluation. - fault_
report - Fault classification for eval-driven autonomous self-improvement.
- fixtures
- YAML-backed golden-prompt fixtures for the evaluation framework.
- ranking_
metrics - Ranking quality metrics for information retrieval evaluation.
- recorder
- Tool call sequence recording and diff.
- regression
- Regression testing infrastructure for CI integration.
- stability_
tests - Long-horizon stability test cases for the Brainwires evaluation framework.
- suite
- Evaluation suite — N-trial Monte Carlo runner.
- trial
- Evaluation trial results and statistical analysis.