Skip to main content

Crate brainwires_eval

Crate brainwires_eval 

Source
Expand description

§brainwires-eval

Evaluation framework for Brainwires agents.

§What’s included

ModuleKey typePurpose
trialTrialResult, EvaluationStatsPer-trial results + Wilson-score 95 % CI
caseEvaluationCaseTrait for a single evaluatable scenario
suiteEvaluationSuite, SuiteResultN-trial Monte Carlo runner
recorderToolSequenceRecorder, SequenceDiffRecord + diff tool call sequences
adversarialAdversarialTestCasePrompt injection, ambiguity, budget stress

Re-exports§

pub use trial::ConfidenceInterval95;
pub use trial::EvaluationStats;
pub use trial::TrialResult;
pub use case::AlwaysFailCase;
pub use case::AlwaysPassCase;
pub use case::EvaluationCase;
pub use case::StochasticCase;
pub use suite::EvaluationSuite;
pub use suite::SuiteConfig;
pub use suite::SuiteResult;
pub use recorder::SequenceDiff;
pub use recorder::ToolCallRecord;
pub use recorder::ToolSequenceRecorder;
pub use adversarial::AdversarialTestCase;
pub use adversarial::AdversarialTestType;
pub use regression::CategoryBaseline;
pub use regression::CategoryRegressionResult;
pub use regression::RegressionConfig;
pub use regression::RegressionResult;
pub use regression::RegressionSuite;
pub use stability_tests::GoalPreservationCase;
pub use stability_tests::LoopDetectionSimCase;
pub use stability_tests::long_horizon_stability_suite;
pub use fault_report::FaultKind;
pub use fault_report::FaultReport;
pub use fault_report::analyze_suite_for_faults;
pub use fixtures::Assertion;
pub use fixtures::ExpectedBehavior;
pub use fixtures::Fixture;
pub use fixtures::FixtureCase;
pub use fixtures::FixtureMessage;
pub use fixtures::FixtureRunner;
pub use fixtures::RunOutcome;
pub use fixtures::load_fixture_file;
pub use fixtures::load_fixtures_from_dir;
pub use ranking_metrics::mrr;
pub use ranking_metrics::ndcg_at_k;
pub use ranking_metrics::precision_at_k;

Modules§

adversarial
Adversarial test cases for robustness evaluation.
case
The EvaluationCase trait — the unit of evaluation.
fault_report
Fault classification for eval-driven autonomous self-improvement.
fixtures
YAML-backed golden-prompt fixtures for the evaluation framework.
ranking_metrics
Ranking quality metrics for information retrieval evaluation.
recorder
Tool call sequence recording and diff.
regression
Regression testing infrastructure for CI integration.
stability_tests
Long-horizon stability test cases for the Brainwires evaluation framework.
suite
Evaluation suite — N-trial Monte Carlo runner.
trial
Evaluation trial results and statistical analysis.