agentcarousel 0.2.3

Evaluate agents and skills with YAML fixtures, run cases (mock or live), and keep run rows in SQLite for reports and evidence export.
Documentation

agentcarousel

agentcarousel is a Rust CLI (and library) for working with fixture files: schema checks, scenario runs (mock or live), and SQLite history.

Evaluate agent behavior and skills with reproducible fixtures, scored checks, and exportable evidence.

Build

cargo build

Configuration

The CLI reads agentcarousel.toml from the repository root or
~/.config/agentcarousel/config.toml. Use --config <path> to override.

Run history defaults to:

  • macOS: ~/Library/Application Support/agentcarousel/history.db
  • Linux: ~/.local/share/agentcarousel/history.db

Set AGENTCAROUSEL_HISTORY_DB

Live generation + judge evaluation are supported with Gemini, OpenAI, Anthropic, and OpenRouter models.

Quickstart (5 minutes)

# 1) Build
cargo build

# 2) Export a provider key (examples)
export GEMINI_API_KEY=your_key_here
# or export OPENAI_API_KEY=your_key_here
# or export OPENROUTER_API_KEY=your_key_here

# 3) Run a live evaluation (Gemini example)
cargo run -p agentcarousel -- eval --execution-mode live \
  --model gemini-2.5-flash \
  --judge --judge-model gemini-2.5-flash

Provider recipes

# Budget: OpenRouter free tier generator + Gemini judge
export OPENROUTER_API_KEY=your_key_here
export GEMINI_API_KEY=your_key_here
cargo run -p agentcarousel -- eval --execution-mode live \
  --model openrouter/free \
  --judge --judge-model gemini-2.5-flash

# Budget: OpenRouter free tier generator + OpenRouter judge
export OPENROUTER_API_KEY=your_key_here
cargo run -p agentcarousel -- eval --execution-mode live \
  --model openrouter/free \
  --judge --judge-model nvidia/nemotron-3-super-120b-a12b:free

# Balanced: OpenAI generator + OpenAI judge
export OPENAI_API_KEY=your_key_here
cargo run -p agentcarousel -- eval --execution-mode live \
  --model gpt-4o-mini \
  --judge --judge-model gpt-4o-mini

# Premium: Gemini generator + Gemini judge
export GEMINI_API_KEY=your_key_here
cargo run -p agentcarousel -- eval --execution-mode live \
  --model gemini-2.5-flash \
  --judge --judge-model gemini-2.5-flash

For more examples and troubleshooting, see [docs/quickstart.md](docs/quickstart.md).

Common commands

# Validate fixtures against schema (paths required)
cargo run -p agentcarousel -- validate fixtures/examples/example-skill.yaml

# Run tests from fixture paths (defaults to fixtures/)
cargo run -p agentcarousel -- test

# Evaluation pass (see docs/evaluator-contract.md)
cargo run -p agentcarousel -- eval

# Report on stored runs
cargo run -p agentcarousel -- report list

# Scaffold a new fixture YAML
cargo run -p agentcarousel -- init --skill my-skill-name

# Bundle pack/verify (M3)
cargo run -p agentcarousel -- bundle pack my-bundle --out my-bundle.tar.gz
cargo run -p agentcarousel -- bundle verify my-bundle.tar.gz

# Registry workflow (single publish command)
cargo run -p agentcarousel -- publish fixtures/bundles/terraform-sentinel-scaffold --url "https://api.agentcarousel.com"

# Publish all matching local runs for that bundle (newest first)
cargo run -p agentcarousel -- publish fixtures/bundles/terraform-sentinel-scaffold --url "https://api.agentcarousel.com" --all-runs --limit 5

# Export an evidence pack for a run id
cargo run -p agentcarousel -- export <RUN_ID>

# Check registry trust state (online-first)
cargo run -p agentcarousel -- trust-check terraform-sentinel-scaffold@1.0.0 --url "https://api.agentcarousel.com"

# Optional offline minisign verification with local attestation
cargo run -p agentcarousel -- trust-check terraform-sentinel-scaffold@1.0.0 \
  --url "https://api.agentcarousel.com" \
  --attestation ./attestation-terraform-sentinel-scaffold-1.0.0.json \
  --minisign-pubkey ./agentcarousel-minisign.pub

Internal modules

CLI, core, fixtures, runner, evaluators, and reporters live as submodules under crates/agentcarousel/src/ in one Cargo package.

Documentation

See Github Repo

ATF / trust: AgentCarousel maps to the Agentic Trust Framework as an evidence + CI gates implementation.