swarm-engine-eval
Scenario-based evaluation framework for SwarmEngine agent swarms.
Usage
Running from CLI (Recommended)
# From project root
# Example: Troubleshooting scenario
# With options
CLI Options
| Option | Description |
|---|---|
-n, --runs <N> |
Number of evaluation runs (default: 1) |
-s, --seed <SEED> |
Random seed (default: 42) |
-o, --output <FILE> |
JSON report output file |
-v, --verbose |
Verbose output with tick snapshots |
--learning |
Enable learning data collection |
--variant <NAME> |
Select scenario variant |
--list-variants |
List available variants |
Scenarios
Built-in Scenarios
Located in scenarios/ directory:
| Scenario | Description |
|---|---|
troubleshooting.toml |
Service diagnosis and recovery |
code_exploration.toml |
Codebase exploration |
search.toml |
Search tasks |
internal_diagnosis.toml |
Internal system diagnosis |
Scenario Format
[]
= "Service Troubleshooting"
= "user:troubleshooting:v2"
= "2.0.0"
= "Diagnose and fix a service outage"
= ["troubleshooting", "diagnosis", "ops"]
[]
= "Diagnose the failing service and restart it"
= "Worker successfully restarts the problematic service"
[]
= "user-service"
= 1
[]
= "llama-server"
= "LFM2.5-1.2B"
= "http://localhost:8080"
= 0.1
= 30000
= 512
[]
= false
= 5
= true
= 0.3
[[]]
= "CheckStatus"
= "Check the status of services"
[[]]
= "service"
= "Optional: specific service name to check"
= false
[[]]
= "ReadLogs"
= "Read logs for a specific service"
[[]]
= "Diagnose"
= "Diagnose the root cause of issues"
[[]]
= "Restart"
= "Restart a service"
= "node_state_change"
[]
= 10
= 150
Scenario Variants
Scenarios can define variants for different configurations:
# List variants
# Run with variant
Environment Types
| Type | Description |
|---|---|
troubleshooting |
Service troubleshooting simulation |
codebase |
File operation environment (Read/Write/Grep/Glob) |
none |
Empty environment (for testing) |
Learning Integration
The eval system integrates with the offline learning system:
# 1. Collect learning data
# 2. Run offline learning
# 3. Next eval will use learned parameters
Assertions
Scenarios can define assertions for pass/fail criteria:
[[]]
= "minimum_success_rate"
= "success_rate"
= "gte"
= 0.5
[[]]
= "max_ticks_limit"
= "total_ticks"
= "lte"
= 100
Output
Eval produces:
- Console output with progress and results
- JSON report (with
-ooption) - Learning data (with
--learningoption) - Tick snapshots in verbose mode (with
-voption)