Expand description
swarm-engine-eval - Evaluation Framework for SwarmEngine
Evaluation framework for multi-agent systems. Builds and executes SwarmApp from TOML scenario definitions for reproducible evaluation.
§Design Philosophy
Inspired by Python evaluation frameworks (lm-evaluation-harness, RAGAS, etc.), providing a comprehensive evaluation foundation for Rust/SwarmEngine:
- Declarative Scenario Definition: Describe evaluation conditions and success criteria in TOML
- Reproducibility: Deterministic evaluation execution through seed management
- Extensibility: Plugin architecture via Registry pattern
- Statistical Analysis: Aggregation of N runs, pass@k, confidence intervals
§Architecture
┌────────────────────────────────────────────────────────────────┐
│ scenarios/*.toml │
│ ┌────────────┐ ┌─────────────┐ ┌────────────┐ ┌─────────┐ │
│ │ meta │ │ app_config │ │ agents │ │conditions│ │
│ └────────────┘ └─────────────┘ └────────────┘ └─────────┘ │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ ScenarioRunner │
│ ┌─────────────────┐ ┌──────────────────┐ ┌───────────────┐ │
│ │ SwarmRegistry │ │EnvironmentRegistry│ │ Seed Manager │ │
│ │ (Agent Factory)│ │ (Fixture Factory)│ │(Reproducibility)│
│ └─────────────────┘ └──────────────────┘ └───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ SwarmApp │ │
│ │ Workers + Manager + Hooks → Orchestrator → Outcome │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Condition Evaluation Engine │ │
│ │ success / failure conditions / timeout / milestone │ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────┐
│ EvalReport │
│ ConfigSummary + runs[] + AggregatedResults + Assertions │
│ ↓ JSON output │
│ report.json │
└────────────────────────────────────────────────────────────────┘§Features
- N-Run Statistical Processing: Mean, standard deviation, 95% confidence interval
- pass@k Calculation: Success probability accounting for non-determinism
- Tick-Level Metrics: latency p95/p99, jitter, miss rate
- Coordination Metrics: Manager intervention rate, delegation efficiency
- Condition Evaluation: success/failure conditions, timeout handling, milestones
- Fault Injection: Effect-based declarative fault injection (TODO)
§Key Components
| Module | Description |
|---|---|
runner | ScenarioRunner - Integrated evaluation framework |
scenario | EvalScenario, Conditions, Milestone definitions |
environment | EnvironmentRegistry - Evaluation environment factory |
[swarms] | SwarmRegistry - Agent generation factory |
metrics | Task, Coordination, Performance metrics |
aggregator | Statistical aggregation (pass@k, confidence intervals) |
reporter | EvalReport, JSON output |
§Usage Examples
§Running Evaluation from TOML Scenario
ⓘ
use swarm_engine_eval::prelude::*;
use swarm_engine_eval::runner::ScenarioRunner;
// Load scenario from TOML file
let content = std::fs::read_to_string("scenarios/simple_task.toml")?;
let scenario: EvalScenario = toml::from_str(&content)?;
// Execute evaluation with ScenarioRunner
let runner = ScenarioRunner::new(scenario, runtime.handle().clone())
.with_runs(5) // Run 5 times
.with_seed(42); // Fix seed for reproducibility
let report = runner.run()?;
// Display results
println!("Success rate: {:.1}%", report.aggregated.success_rate * 100.0);
println!("Pass@1: {:.1}%", report.aggregated.pass_at_1 * 100.0);
// JSON output
report.to_json_file("report.json")?;§Programmatic Scenario Definition
ⓘ
use swarm_engine_eval::scenario::*;
let scenario = EvalScenario {
meta: ScenarioMeta {
name: "My Evaluation".to_string(),
id: ScenarioId::new("my:eval:v1"),
// ...
},
app_config: AppConfigTemplate::default(),
agents: AgentsConfig {
workers: vec![WorkerTemplate {
id_pattern: "worker_{i}".to_string(),
count: 4,
role: "counter".to_string(),
config: serde_json::json!({"total_tasks": 10}),
}],
managers: vec![],
},
conditions: EvalConditions {
success: vec![Condition::new("done", "task.completed_count", CompareOp::Gte, 40)],
failure: vec![],
on_timeout: TimeoutBehavior::Fail,
},
// ...
};§Future Extensions
- SwarmRegistry: Manager support
- EnvironmentRegistry: task_queue, shared_workspace, etc.
- Comparator: Multi-scenario/configuration comparison
- FaultInjector: Fault injection framework
Re-exports§
Modules§
- aggregator
- Aggregator - Statistical calculations
- config
- Evaluation configuration
- environment
- Environment Registry - 評価環境ファクトリ
- environments
- Eval 環境モジュール
- error
- Error types for swarm-engine-eval
- metrics
- Metrics structures
- prelude
- Prelude - commonly used types for convenient import
- reporter
- Report generation
- run
- EvalRun - Single evaluation run result
- runner
- EvalRunner - LearnableSwarm を使用した評価実行
- runtime
- Runtime Task Specification
- scenario
- Eval シナリオ管理
- validation
- Scenario Validation