Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
miyabi-benchmark
World-standard benchmark evaluation framework for the Miyabi AI development platform.
๐ Overview
miyabi-benchmark provides a comprehensive evaluation framework for benchmarking Miyabi's autonomous development capabilities against world-standard datasets. It supports parallel evaluation, detailed reporting, and integration with the Miyabi worktree system for isolated execution.
Supported Benchmarks:
- ๐ SWE-bench Pro (ScaleAI) - 731 software engineering task instances
- ๐ค AgentBench (THUDM) - 8 agent capability environments
- ๐ HAL (Princeton) - Cost-efficient holistic evaluation across 9 benchmarks
- ๐ Galileo Agent Leaderboard v2 - Enterprise-grade evaluation for 5 industries
Key Capabilities:
- ๐ฆ Dataset Management: Load, filter, and manage benchmark datasets
- โ๏ธ Parallel Evaluation: Concurrent instance processing with configurable concurrency
- ๐ Isolated Execution: Git worktree-based sandboxing for each evaluation
- โฑ๏ธ Timeout Management: Configurable timeout per instance (default: 30 min)
- ๐ Statistical Reporting: Success rate, duration, and performance metrics
- ๐ฏ Patch Generation: Unified diff format for submission to official leaderboards
๐ Features
SWE-bench Pro Support
- Dataset Loading: Load from JSON (HuggingFace format)
- Language Filtering: Filter by Python, JavaScript, TypeScript, Go, Rust, etc.
- Repository Filtering: Focus on specific repos (e.g.,
django/django) - Patch Generation: Generate unified diffs for official evaluation
- Test Validation: Run test suites to verify fixes
Evaluation Pipeline
- Setup: Create isolated worktree for each instance
- Execution: Run CoordinatorAgent to generate fix
- Patch: Generate unified diff patch
- Validation: Run tests to verify correctness
- Reporting: Collect metrics and generate report
Performance Tracking
- Success Rate: Percentage of correctly fixed instances
- Timing: Min, max, average, and total duration
- Failure Analysis: Error categorization and debugging info
- Comparison: Benchmark against state-of-the-art agents
๐ฆ Installation
Add to your Cargo.toml:
[]
= "0.1.0"
Or install the CLI tool:
๐ง Usage
As a Library
use ;
use Result;
async
As a CLI Tool
# Download SWE-bench Pro dataset
# Run evaluation on all instances
# Run with custom config
# Filter by language
# Filter by repository
# Generate report from existing results
๐ Benchmark Details
SWE-bench Pro (ScaleAI)
Dataset: 731 software engineering task instances from popular open-source projects
Format:
Evaluation Metrics:
- Accuracy: Percentage of correctly fixed instances
- Pass@1: Success rate on first attempt
- Avg. Duration: Average time per instance
- Token Efficiency: Tokens used per successful fix
AgentBench (THUDM)
Dataset: 8 environments covering diverse agent capabilities
Environments:
- OS Interaction: Shell commands, file operations
- Database Queries: SQL generation and execution
- Knowledge Graph: Entity/relation reasoning
- Digital Card Game: Multi-step planning
- Lateral Thinking: Creative problem-solving
- House-Holding: Common-sense reasoning
- Web Shopping: Web interaction and decision-making
- Web Browsing: Information retrieval
HAL (Princeton)
Dataset: Cost-efficient holistic evaluation across 9 benchmarks
Benchmarks:
- MMLU, GSM8K, HumanEval, MATH, DROP, HellaSwag, ARC, TruthfulQA, BigBench-Hard
Focus: Optimize for cost-per-token while maintaining accuracy
Galileo Agent Leaderboard v2
Dataset: Enterprise-grade evaluation for 5 industries
Industries:
- Finance, Healthcare, Legal, E-commerce, Manufacturing
Metrics: Accuracy, Latency, Cost, Safety, Compliance
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SWEBenchDataset โ โ Load & Filter
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SWEBenchProEvaluator โ โ Parallel Eval
โ - Concurrency: 5 โ
โ - Timeout: 30 min โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WorktreeManager โ โ Isolated Execution
โ - Per-instance sandbox โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CoordinatorAgent โ โ Generate Fix
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Patch Generation โ โ Unified Diff
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Test Validation โ โ Run Tests
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EvaluationReporter โ โ Generate Report
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐ Example Results
๐งช Testing
# Run all tests
# Run evaluator tests
# Run dataset tests
# Integration tests (requires dataset)
๐ Dependencies
- Core:
miyabi-types,miyabi-core,miyabi-agents,miyabi-worktree - Runtime:
tokio,async-trait - Serialization:
serde,serde_json - HTTP:
reqwest(for HuggingFace API) - CLI:
clap,indicatif(optional, feature-gated) - Utilities:
anyhow,thiserror,chrono,tracing
๐ Related Crates
miyabi-agents- Agent implementations for evaluationmiyabi-worktree- Isolated execution environmentmiyabi-types- Shared type definitionsmiyabi-core- Core utilities
๐ฏ Official Leaderboards
Submit your results to official leaderboards:
- SWE-bench Pro: https://www.swebench.com/leaderboard
- AgentBench: https://llmbench.ai/agentbench
- HAL: https://hal-leaderboard.com/
- Galileo: https://www.galileo-leaderboard.com/
๐ค Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
๐ License
Licensed under the MIT License. See LICENSE for details.
๐ Version History
- v0.1.0 (2025-10-25): Initial release
- SWE-bench Pro dataset loading and evaluation
- Parallel evaluation with configurable concurrency
- Worktree-based isolated execution
- Detailed reporting and statistics
- AgentBench, HAL, Galileo support (planned)
Part of the Miyabi Framework - Autonomous AI Development Platform