adk-bench 1.0.0

Benchmarking framework for ADK-Rust agent performance measurement
Documentation

adk-bench

A comprehensive benchmarking framework for ADK-Rust that measures framework-level runtime performance using real LLM APIs and supports cross-framework comparison.

Results

Framework Comparison: simple_tool_call

All frameworks execute the same workload (weather tool call) against gemini-2.5-flash with identical prompts and deterministic config (temperature=0).

Framework Cold Start Agent Loop Overhead (mean) Agent Loop Overhead (P95) Peak RSS
ADK-Rust 109 ms 568 μs 615 μs ~15 MB
Gemini Python SDK 501 ms 253 μs 334 μs 69.7 MB
LangGraph 502 ms 1,228 ms 1,228 ms 92.7 MB

Full ADK-Rust Suite (3 runs, 1 warmup)

Workload Cold Start (mean) Loop Overhead (mean) Loop Overhead (P95) CV
simple_tool_call 117 ms 368 μs 475 μs 20.6%
multi_step_reasoning 129 ms 38.6 ms 56.4 ms 41.3%
parallel_tool_invocation 117 ms 159.6 ms 215.5 ms 29.2%

Note: Multi-step and parallel workloads include simulated tool latency (10-25ms per tool call) in the overhead measurement. The simple_tool_call workload best isolates pure framework overhead.

Key Takeaways

  • 4.6× faster cold start — Rust binary startup vs Python interpreter (109ms vs 501ms)
  • Sub-millisecond framework overhead — ADK-Rust adds ~568μs per agent turn beyond LLM latency
  • 4–6× lower memory — ~15MB RSS vs 70-93MB for Python frameworks
  • Deterministic measurement — temperature=0, fixed seed, structured output for reproducibility

How It Works

┌─────────────────────────────────────────────────────────────────┐
│ cargo adk bench                                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  BenchRunner                                                      │
│  ├── Warm-up phase (iterations discarded)                        │
│  ├── Measurement phase                                           │
│  │   ├── InstrumentedLlm → Real API (temp=0, seed=42)          │
│  │   ├── Record: request_sent → response_complete                │
│  │   └── Overhead = total_turn - llm_round_trip                  │
│  ├── Concurrency sweep (1, 2, 4, 8, 16, 32, 64)                │
│  ├── Memory sampling (RSS via platform APIs)                     │
│  └── Regression detection (baseline compare)                     │
│                                                                   │
│  ExternalRunner (EBP Protocol)                                    │
│  ├── Spawn subprocess with BENCH_START_EPOCH_NS                  │
│  ├── Pass workload JSON as last arg                              │
│  ├── Parse EBP JSON from stdout                                  │
│  └── Compute cold_start from external clock                      │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘

Metrics Collected

Metric Description
Cold Start Process launch → first LLM API call
Agent Loop Overhead Per-turn framework processing time (total turn minus LLM round-trip)
Concurrent Throughput Agents completed per second at N concurrency
Memory Footprint Peak RSS via /proc/self/statm (Linux) or mach_task_basic_info (macOS)
Tool Invocation Latency Deserialization + schema validation + execution dispatch
Token Overhead Framework-injected tokens beyond user content

Usage

Basic Run

# Run all built-in workloads with default settings
cargo adk bench --confirm-cost

# Single workload, minimal cost
cargo adk bench --workload simple_tool_call --runs 2 --warmup 1 --confirm-cost

Cost Control

# See estimated cost without making API calls
cargo adk bench --dry-run

# Set a hard cost limit
cargo adk bench --max-cost-usd 5.00 --confirm-cost

# Minimal run for quick validation
cargo adk bench --workload simple_tool_call --runs 1 --warmup 0 --confirm-cost

Framework Comparison

# Compare ADK-Rust against Python frameworks
cargo adk bench --workload simple_tool_call --runs 3 \
  --external-config adk-bench/harnesses/external-frameworks.json \
  --format markdown --confirm-cost

Concurrency Sweep

# Test throughput at multiple concurrency levels
cargo adk bench --workload simple_tool_call --sweep --runs 3 --confirm-cost

Regression Detection (CI)

# Save a baseline
cargo adk bench --save-baseline --confirm-cost

# Check for regressions (exit code 2 if regressed)
cargo adk bench --check-regression --tolerance 0.10 --confirm-cost

Output Formats

# JSON (machine-readable, all raw metrics)
cargo adk bench --format json --output results.json --confirm-cost

# Markdown (README-ready comparison table)
cargo adk bench --format markdown --confirm-cost

# Table (terminal display, default)
cargo adk bench --format table --confirm-cost

CLI Reference

Flag Default Description
--model gemini-2.5-flash LLM model identifier
--runs 5 Measurement iterations per workload
--warmup 3 Warm-up iterations (discarded)
--concurrency 1 Agent concurrency level
--workload all built-in Specific workload name or file path
--format table Output format: table, json, markdown
--output stdout Write results to file
--sweep off Test concurrency levels 1,2,4,8,16,32,64
--save-baseline off Persist results for regression comparison
--check-regression off Compare against saved baseline
--tolerance 0.10 Max allowed degradation (10%)
--dry-run off Show estimated cost, execute nothing
--max-cost-usd none Abort if estimated cost exceeds limit
--confirm-cost off Auto-confirm when cost > $1.00
--external-config none Path to external framework config JSON
--external-timeout 300 Timeout (seconds) for external runners
--suite none Task quality suite: tau2 or bfcl
--experimental off Enable experimental workloads

Built-in Workloads

Workload Description Expected Turns Tools
simple_tool_call Single tool invocation (weather lookup) 2 1
multi_step_reasoning Sequential tool chain (search → details → shipping) 4 3
parallel_tool_invocation Concurrent tool calls (stock price + news + rating) 2 3
multi_agent_delegation* Coordinator delegates to specialist agents 5 2

*Requires --experimental flag.

External Benchmark Protocol (EBP)

Competitor frameworks are benchmarked via subprocess. Each harness receives:

  • BENCH_START_EPOCH_NS environment variable (monotonic nanoseconds at spawn)
  • Workload JSON file path as the last CLI argument

And must output exactly one JSON object on stdout:

{
  "framework": "langgraph",
  "cold_start_us": 45000,
  "first_llm_call_epoch_ns": 1705312800000045000,
  "loop_overhead": {
    "min_us": 120,
    "max_us": 890,
    "mean_us": 340,
    "median_us": 310,
    "p95_us": 780,
    "p99_us": 870,
    "count": 10
  },
  "throughput_agents_per_sec": 12.5,
  "peak_rss_bytes": 52428800,
  "token_overhead": {
    "total_tokens": 1200,
    "user_content_tokens": 950,
    "overhead_tokens": 250
  }
}

Writing a Harness

See harnesses/ for reference implementations:

  • bench_gemini_sdk.py — Raw Google Gemini Python SDK (no framework)
  • bench_langgraph.py — LangGraph ReAct agent

Configure them in harnesses/external-frameworks.json:

{
  "frameworks": [
    {
      "name": "my-framework",
      "command": "python3",
      "args": ["/path/to/my_harness.py"],
      "env": [],
      "workingDir": null
    }
  ]
}

Architecture

adk-bench/
├── src/
│   ├── lib.rs                  # Public exports
│   ├── config.rs               # BenchConfig, CLI flag mapping
│   ├── runner.rs               # BenchRunner orchestrator
│   ├── workload.rs             # Workload schema, built-in workloads
│   ├── metrics.rs              # DurationStats, BenchmarkResult, MetricCollector
│   ├── memory.rs               # Platform-specific RSS sampling
│   ├── instrumented_llm.rs     # InstrumentedLlm wrapper (temp=0, timing capture)
│   ├── external.rs             # ExternalRunner, EBP protocol
│   ├── formatter.rs            # JSON/table/markdown output
│   ├── error.rs                # BenchError enum
│   └── adapters/
│       ├── mod.rs              # TaskQualityAdapter trait
│       ├── tau2.rs             # τ²-bench adapter (feature: tau2)
│       └── bfcl.rs             # BFCL adapter (feature: bfcl)
├── harnesses/
│   ├── bench_gemini_sdk.py     # Gemini Python SDK EBP harness
│   ├── bench_langgraph.py      # LangGraph EBP harness
│   ├── external-frameworks.json
│   └── simple_tool_call.json   # Workload file for external harnesses
└── tests/

Feature Flags

Feature Description
tau2 τ²-bench task quality adapter
bfcl Berkeley Function Calling Leaderboard adapter

Design Principles

  • Real LLM calls — No mocks. Deterministic config (temperature=0, top_p=1.0, seed=42) for reproducibility.
  • Overhead isolationInstrumentedLlm captures per-call timing; framework overhead = total_turn - llm_round_trip.
  • Apples-to-apples comparison — All frameworks get the same workload, same model, same BENCH_START_EPOCH_NS clock source.
  • Cost awareness--dry-run, --max-cost-usd, --confirm-cost prevent surprise bills.
  • CI-ready--check-regression exits with code 2 on regressions, integrates with any CI system.
  • Platform-specific memory — Linux /proc/self/statm (authoritative), macOS mach_task_basic_info (informational).

Environment

Requires GOOGLE_API_KEY for Gemini models. Set it before running:

export GOOGLE_API_KEY="your-api-key"

For external Python harnesses, ensure dependencies are installed:

pip install google-generativeai langgraph langchain-google-genai langchain-core

Measurement Notes

  • Cold start is measured from process spawn (or BENCH_START_EPOCH_NS) to the first generate_content call timestamp.
  • Agent loop overhead is computed per-turn by subtracting the LLM round-trip from total turn wall-clock time.
  • CV > 20% warning is emitted when overhead measurements are unstable — increase --runs or reduce system load.
  • Linux is authoritative for published cross-framework memory comparisons due to consistent /proc/self/statm RSS reporting.
  • Results measured on Apple M-series, macOS, June 2026. Your numbers will vary by hardware and network.

License

Apache 2.0