adk-bench
A comprehensive benchmarking framework for ADK-Rust that measures framework-level runtime performance using real LLM APIs and supports cross-framework comparison.
Results
Framework Comparison: simple_tool_call
All frameworks execute the same workload (weather tool call) against gemini-2.5-flash with identical prompts and deterministic config (temperature=0).
| Framework | Cold Start | Agent Loop Overhead (mean) | Agent Loop Overhead (P95) | Peak RSS |
|---|---|---|---|---|
| ADK-Rust | 109 ms | 568 μs | 615 μs | ~15 MB |
| Gemini Python SDK | 501 ms | 253 μs | 334 μs | 69.7 MB |
| LangGraph | 502 ms | 1,228 ms | 1,228 ms | 92.7 MB |
Full ADK-Rust Suite (3 runs, 1 warmup)
| Workload | Cold Start (mean) | Loop Overhead (mean) | Loop Overhead (P95) | CV |
|---|---|---|---|---|
| simple_tool_call | 117 ms | 368 μs | 475 μs | 20.6% |
| multi_step_reasoning | 129 ms | 38.6 ms | 56.4 ms | 41.3% |
| parallel_tool_invocation | 117 ms | 159.6 ms | 215.5 ms | 29.2% |
Note: Multi-step and parallel workloads include simulated tool latency (10-25ms per tool call) in the overhead measurement. The simple_tool_call workload best isolates pure framework overhead.
Key Takeaways
- 4.6× faster cold start — Rust binary startup vs Python interpreter (109ms vs 501ms)
- Sub-millisecond framework overhead — ADK-Rust adds ~568μs per agent turn beyond LLM latency
- 4–6× lower memory — ~15MB RSS vs 70-93MB for Python frameworks
- Deterministic measurement — temperature=0, fixed seed, structured output for reproducibility
How It Works
┌─────────────────────────────────────────────────────────────────┐
│ cargo adk bench │
├─────────────────────────────────────────────────────────────────┤
│ │
│ BenchRunner │
│ ├── Warm-up phase (iterations discarded) │
│ ├── Measurement phase │
│ │ ├── InstrumentedLlm → Real API (temp=0, seed=42) │
│ │ ├── Record: request_sent → response_complete │
│ │ └── Overhead = total_turn - llm_round_trip │
│ ├── Concurrency sweep (1, 2, 4, 8, 16, 32, 64) │
│ ├── Memory sampling (RSS via platform APIs) │
│ └── Regression detection (baseline compare) │
│ │
│ ExternalRunner (EBP Protocol) │
│ ├── Spawn subprocess with BENCH_START_EPOCH_NS │
│ ├── Pass workload JSON as last arg │
│ ├── Parse EBP JSON from stdout │
│ └── Compute cold_start from external clock │
│ │
└─────────────────────────────────────────────────────────────────┘
Metrics Collected
| Metric | Description |
|---|---|
| Cold Start | Process launch → first LLM API call |
| Agent Loop Overhead | Per-turn framework processing time (total turn minus LLM round-trip) |
| Concurrent Throughput | Agents completed per second at N concurrency |
| Memory Footprint | Peak RSS via /proc/self/statm (Linux) or mach_task_basic_info (macOS) |
| Tool Invocation Latency | Deserialization + schema validation + execution dispatch |
| Token Overhead | Framework-injected tokens beyond user content |
Usage
Basic Run
# Run all built-in workloads with default settings
# Single workload, minimal cost
Cost Control
# See estimated cost without making API calls
# Set a hard cost limit
# Minimal run for quick validation
Framework Comparison
# Compare ADK-Rust against Python frameworks
Concurrency Sweep
# Test throughput at multiple concurrency levels
Regression Detection (CI)
# Save a baseline
# Check for regressions (exit code 2 if regressed)
Output Formats
# JSON (machine-readable, all raw metrics)
# Markdown (README-ready comparison table)
# Table (terminal display, default)
CLI Reference
| Flag | Default | Description |
|---|---|---|
--model |
gemini-2.5-flash |
LLM model identifier |
--runs |
5 |
Measurement iterations per workload |
--warmup |
3 |
Warm-up iterations (discarded) |
--concurrency |
1 |
Agent concurrency level |
--workload |
all built-in | Specific workload name or file path |
--format |
table |
Output format: table, json, markdown |
--output |
stdout | Write results to file |
--sweep |
off | Test concurrency levels 1,2,4,8,16,32,64 |
--save-baseline |
off | Persist results for regression comparison |
--check-regression |
off | Compare against saved baseline |
--tolerance |
0.10 |
Max allowed degradation (10%) |
--dry-run |
off | Show estimated cost, execute nothing |
--max-cost-usd |
none | Abort if estimated cost exceeds limit |
--confirm-cost |
off | Auto-confirm when cost > $1.00 |
--external-config |
none | Path to external framework config JSON |
--external-timeout |
300 |
Timeout (seconds) for external runners |
--suite |
none | Task quality suite: tau2 or bfcl |
--experimental |
off | Enable experimental workloads |
Built-in Workloads
| Workload | Description | Expected Turns | Tools |
|---|---|---|---|
simple_tool_call |
Single tool invocation (weather lookup) | 2 | 1 |
multi_step_reasoning |
Sequential tool chain (search → details → shipping) | 4 | 3 |
parallel_tool_invocation |
Concurrent tool calls (stock price + news + rating) | 2 | 3 |
multi_agent_delegation* |
Coordinator delegates to specialist agents | 5 | 2 |
*Requires --experimental flag.
External Benchmark Protocol (EBP)
Competitor frameworks are benchmarked via subprocess. Each harness receives:
BENCH_START_EPOCH_NSenvironment variable (monotonic nanoseconds at spawn)- Workload JSON file path as the last CLI argument
And must output exactly one JSON object on stdout:
Writing a Harness
See harnesses/ for reference implementations:
bench_gemini_sdk.py— Raw Google Gemini Python SDK (no framework)bench_langgraph.py— LangGraph ReAct agent
Configure them in harnesses/external-frameworks.json:
Architecture
adk-bench/
├── src/
│ ├── lib.rs # Public exports
│ ├── config.rs # BenchConfig, CLI flag mapping
│ ├── runner.rs # BenchRunner orchestrator
│ ├── workload.rs # Workload schema, built-in workloads
│ ├── metrics.rs # DurationStats, BenchmarkResult, MetricCollector
│ ├── memory.rs # Platform-specific RSS sampling
│ ├── instrumented_llm.rs # InstrumentedLlm wrapper (temp=0, timing capture)
│ ├── external.rs # ExternalRunner, EBP protocol
│ ├── formatter.rs # JSON/table/markdown output
│ ├── error.rs # BenchError enum
│ └── adapters/
│ ├── mod.rs # TaskQualityAdapter trait
│ ├── tau2.rs # τ²-bench adapter (feature: tau2)
│ └── bfcl.rs # BFCL adapter (feature: bfcl)
├── harnesses/
│ ├── bench_gemini_sdk.py # Gemini Python SDK EBP harness
│ ├── bench_langgraph.py # LangGraph EBP harness
│ ├── external-frameworks.json
│ └── simple_tool_call.json # Workload file for external harnesses
└── tests/
Feature Flags
| Feature | Description |
|---|---|
tau2 |
τ²-bench task quality adapter |
bfcl |
Berkeley Function Calling Leaderboard adapter |
Design Principles
- Real LLM calls — No mocks. Deterministic config (temperature=0, top_p=1.0, seed=42) for reproducibility.
- Overhead isolation —
InstrumentedLlmcaptures per-call timing; framework overhead = total_turn - llm_round_trip. - Apples-to-apples comparison — All frameworks get the same workload, same model, same BENCH_START_EPOCH_NS clock source.
- Cost awareness —
--dry-run,--max-cost-usd,--confirm-costprevent surprise bills. - CI-ready —
--check-regressionexits with code 2 on regressions, integrates with any CI system. - Platform-specific memory — Linux
/proc/self/statm(authoritative), macOSmach_task_basic_info(informational).
Environment
Requires GOOGLE_API_KEY for Gemini models. Set it before running:
For external Python harnesses, ensure dependencies are installed:
Measurement Notes
- Cold start is measured from process spawn (or
BENCH_START_EPOCH_NS) to the firstgenerate_contentcall timestamp. - Agent loop overhead is computed per-turn by subtracting the LLM round-trip from total turn wall-clock time.
- CV > 20% warning is emitted when overhead measurements are unstable — increase
--runsor reduce system load. - Linux is authoritative for published cross-framework memory comparisons due to consistent
/proc/self/statmRSS reporting. - Results measured on Apple M-series, macOS, June 2026. Your numbers will vary by hardware and network.
License
Apache 2.0