Skip to main content

Crate swarm_engine_eval

Crate swarm_engine_eval 

Source
Expand description

swarm-engine-eval - Evaluation Framework for SwarmEngine

Evaluation framework for multi-agent systems. Builds and executes SwarmApp from TOML scenario definitions for reproducible evaluation.

§Design Philosophy

Inspired by Python evaluation frameworks (lm-evaluation-harness, RAGAS, etc.), providing a comprehensive evaluation foundation for Rust/SwarmEngine:

  1. Declarative Scenario Definition: Describe evaluation conditions and success criteria in TOML
  2. Reproducibility: Deterministic evaluation execution through seed management
  3. Extensibility: Plugin architecture via Registry pattern
  4. Statistical Analysis: Aggregation of N runs, pass@k, confidence intervals

§Architecture

┌────────────────────────────────────────────────────────────────┐
│                    scenarios/*.toml                            │
│  ┌────────────┐  ┌─────────────┐  ┌────────────┐  ┌─────────┐ │
│  │    meta    │  │  app_config │  │   agents   │  │conditions│ │
│  └────────────┘  └─────────────┘  └────────────┘  └─────────┘ │
└────────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌────────────────────────────────────────────────────────────────┐
│                      ScenarioRunner                            │
│  ┌─────────────────┐  ┌──────────────────┐  ┌───────────────┐ │
│  │  SwarmRegistry  │  │EnvironmentRegistry│  │ Seed Manager │ │
│  │  (Agent Factory)│  │ (Fixture Factory)│  │(Reproducibility)│
│  └─────────────────┘  └──────────────────┘  └───────────────┘ │
│                              │                                 │
│                              ▼                                 │
│  ┌──────────────────────────────────────────────────────────┐ │
│  │                      SwarmApp                            │ │
│  │   Workers + Manager + Hooks → Orchestrator → Outcome     │ │
│  └──────────────────────────────────────────────────────────┘ │
│                              │                                 │
│                              ▼                                 │
│  ┌──────────────────────────────────────────────────────────┐ │
│  │                Condition Evaluation Engine               │ │
│  │   success / failure conditions / timeout / milestone     │ │
│  └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
                             │
                             ▼
┌────────────────────────────────────────────────────────────────┐
│                        EvalReport                              │
│   ConfigSummary + runs[] + AggregatedResults + Assertions      │
│                      ↓ JSON output                             │
│                 report.json                                    │
└────────────────────────────────────────────────────────────────┘

§Features

  • N-Run Statistical Processing: Mean, standard deviation, 95% confidence interval
  • pass@k Calculation: Success probability accounting for non-determinism
  • Tick-Level Metrics: latency p95/p99, jitter, miss rate
  • Coordination Metrics: Manager intervention rate, delegation efficiency
  • Condition Evaluation: success/failure conditions, timeout handling, milestones
  • Fault Injection: Effect-based declarative fault injection (TODO)

§Key Components

ModuleDescription
runnerScenarioRunner - Integrated evaluation framework
scenarioEvalScenario, Conditions, Milestone definitions
environmentEnvironmentRegistry - Evaluation environment factory
[swarms]SwarmRegistry - Agent generation factory
metricsTask, Coordination, Performance metrics
aggregatorStatistical aggregation (pass@k, confidence intervals)
reporterEvalReport, JSON output

§Usage Examples

§Running Evaluation from TOML Scenario

use swarm_engine_eval::prelude::*;
use swarm_engine_eval::runner::ScenarioRunner;

// Load scenario from TOML file
let content = std::fs::read_to_string("scenarios/simple_task.toml")?;
let scenario: EvalScenario = toml::from_str(&content)?;

// Execute evaluation with ScenarioRunner
let runner = ScenarioRunner::new(scenario, runtime.handle().clone())
    .with_runs(5)      // Run 5 times
    .with_seed(42);    // Fix seed for reproducibility

let report = runner.run()?;

// Display results
println!("Success rate: {:.1}%", report.aggregated.success_rate * 100.0);
println!("Pass@1: {:.1}%", report.aggregated.pass_at_1 * 100.0);

// JSON output
report.to_json_file("report.json")?;

§Programmatic Scenario Definition

use swarm_engine_eval::scenario::*;

let scenario = EvalScenario {
    meta: ScenarioMeta {
        name: "My Evaluation".to_string(),
        id: ScenarioId::new("my:eval:v1"),
        // ...
    },
    app_config: AppConfigTemplate::default(),
    agents: AgentsConfig {
        workers: vec![WorkerTemplate {
            id_pattern: "worker_{i}".to_string(),
            count: 4,
            role: "counter".to_string(),
            config: serde_json::json!({"total_tasks": 10}),
        }],
        managers: vec![],
    },
    conditions: EvalConditions {
        success: vec![Condition::new("done", "task.completed_count", CompareOp::Gte, 40)],
        failure: vec![],
        on_timeout: TimeoutBehavior::Fail,
    },
    // ...
};

§Future Extensions

  • SwarmRegistry: Manager support
  • EnvironmentRegistry: task_queue, shared_workspace, etc.
  • Comparator: Multi-scenario/configuration comparison
  • FaultInjector: Fault injection framework

Re-exports§

pub use error::EvalError;
pub use error::Result;

Modules§

aggregator
Aggregator - Statistical calculations
config
Evaluation configuration
environment
Environment Registry - 評価環境ファクトリ
environments
Eval 環境モジュール
error
Error types for swarm-engine-eval
metrics
Metrics structures
prelude
Prelude - commonly used types for convenient import
reporter
Report generation
run
EvalRun - Single evaluation run result
runner
EvalRunner - LearnableSwarm を使用した評価実行
runtime
Runtime Task Specification
scenario
Eval シナリオ管理
validation
Scenario Validation