agentcarousel 0.6.1

Unit tests for AI agents. Run behavioral tests in CI, score with an LLM judge, and export signed evidence your auditors accept.
Documentation

AgentCarousel

Unit tests for AI agents. The only AI testing tool that produces evidence your auditors accept — run behavioral tests in CI, score with an LLM judge, gate on regressions, and export signed bundles ready for procurement teams and government regulators.

Crates.io Homebrew License: MIT Latest release

AgentCarousel delivers a repeatable, automated way to assess AI agent efficacy and behavior — establishing the trust required before deployment. Tests run deterministically in CI, semantic scoring comes from an LLM-as-a-judge, and results can be certified by a domain expert with a signed attestation.

Why agentcarousel

  • Behavioral certainty before deployment — Declarative YAML fixtures pin what your agent should and shouldn't say. Same inputs, same outputs, every time — without touching a live API.
  • Evidence that stands up to scrutiny — Every run exports a signed bundle (.tar.gz + minisign attestation) a domain expert can certify — ready for auditors, procurement teams, and government regulators.
  • Semantic scoring, not just pattern matching — An LLM-as-a-judge evaluates outputs with contextual understanding, catching regressions.
  • Built for regulated environments — Risk tier, data handling, and certification track are first-class fixtures. Integrates into CI and produces governance artifacts for your compliance program.

Install

# Linux / macOS — slim binary (no dashboard)
curl -fsSL https://install.agentcarousel.com | sh

# Linux / macOS — full binary (includes web dashboard UI)
curl -fsSL https://install.agentcarousel.com | sh -s -- --feature dashboard

# Homebrew (macOS)
brew tap agentcarousel/agentcarousel && brew install agentcarousel

# Cargo (Rust)
cargo install agentcarousel

Upgrade an existing installation to the full variant at any time:

agc update --feature dashboard

Quickstart

# 1. Scaffold a skill fixture
agc init --skill my-skill

# 2. Generate cases with an LLM (no hand-writing required)
agc generate --extend fixtures/my-skill/ --count 8

# 3. Run offline (mock mode, no API keys)
agc test fixtures/my-skill/

# 4. Evaluate with live generation and an LLM judge
agc eval fixtures/my-skill/ --execution-mode live --judge --model gemini-2.5-flash

# 5. Export a signed evidence bundle
agc export -l

See fixtures/regex-builder/ for a complete fixture with all cases, golden outputs, and bundle manifest.

Generate Fixtures

agc generate scaffolds validated YAML fixture cases using your configured generator LLM — no hand-writing required.

# From a skill name and description
agc generate --skill customer-support \
             --description "handles refund and cancellation requests" \
             --count 8

# From an existing system prompt file
agc generate --from-prompt fixtures/customer-support/prompt.md --count 10

# Extend an existing fixture (deduplicates against existing case IDs)
agc generate --extend fixtures/customer-support/ --count 5

# Preview without writing
agc generate --skill my-skill --description "..." --dry-run

# Machine-readable output (for agent workflows)
agc generate --skill my-skill --description "..." --dry-run --json

Generated cases are validated against the fixture schema before being written. If the LLM output fails validation, the command retries once with the errors appended to the prompt. The meta-prompt lives at templates/generate-prompt.md — teams can customize it to specify what "good coverage" means for their domain.

Typical workflow:

agc init --skill customer-support       # scaffold directory structure
# edit fixtures/customer-support/prompt.md
agc generate --extend fixtures/customer-support/ --count 8
agc validate fixtures/customer-support/

Live Eval with LLM-as-a-Judge

# Generator LLM key (the model being tested)
export GEMINI_API_KEY=your_key        # or OPENAI_API_KEY / OPENROUTER_API_KEY

# Judge LLM key (the model scoring outputs)
export ANTHROPIC_API_KEY=your_key     # or bring your own provider

# Run skill fixtures against live APIs with LLM judge
agc eval fixtures/regex-builder/ \
  --execution-mode live \
  --evaluator all --judge \
  --model gemini-2.5-flash \
  --judge-model claude-haiku-4-5-20251001 \
  --runs 1

Execution modes--execution-mode live hits real LLM APIs. Omit it (or pass mock) for deterministic offline runs.

Evaluators--evaluator all honors each case's declared evaluator. --evaluator judge routes every case through the LLM judge regardless. --evaluator mock skips LLM calls entirely.

Filters--filter on skill/case-id; --filter-tags accepts comma-separated tags (e.g. database, safety)

Multi-Model Comparison

agc carousel runs the same fixture suite against multiple models in parallel and prints a ranked comparison table — pass rate, effectiveness score, and latency p50 per model. Every model's run is saved to history so agc compare and the dashboard compare view work immediately.

# Rank three models head-to-head
agc carousel \
  --models gpt-4o,gemini-2.5-flash,claude-sonnet-4-6 \
  fixtures/my-skill/

# With judge scoring
agc carousel \
  --models gpt-4o,gemini-2.5-flash \
  fixtures/ \
  --evaluator all --judge \
  --judge-model claude-haiku-4-5-20251001

# JSON output for downstream tooling
agc carousel --models gpt-4o,gemini-2.5-flash fixtures/ --json

A/B Prompt Comparison

agc ab runs the same test (fixture) against two prompts concurrently and produces a head-to-head comparison. Pass rate, effectiveness score, and per-case winners.

# Compare two prompt variants
agc ab \
  --a fixtures/v1/prompt.md \
  --b fixtures/v2/prompt.md \
  fixtures/my-skill/ \
  --execution-mode live \
  --model gemini-2.5-flash

# With judge scoring
agc ab \
  --a prompts/old.md --b prompts/new.md \
  fixtures/ \
  --evaluator all --judge \
  --judge-model claude-haiku-4-5-20251001

# JSON output
agc ab --a p1.md --b p2.md fixtures/ --json

The --threshold flag controls the effectiveness delta required to declare a winner (default 0.05). Both runs are saved to history.

CI Regression Gate

agc compare compares two eval runs and exits 1 when effectiveness regresses beyond a threshold. When at least 5 matched cases carry effectiveness scores, a Mann-Whitney U test gates the exit: the regression only triggers when the delta exceeds --threshold and the p-value falls below --significance. This prevents single-run noise from failing CI.

# Compare the latest run to an explicit baseline
agc compare -l --baseline <run-id> --threshold 0.05

# Auto-baseline: finds previous run for the same skill
agc compare -l

# Tighten the significance gate for high-stakes checks
agc compare -l --baseline <run-id> --threshold 0.05 --significance 0.01

# Tag a run as a named baseline for CI reference
agc compare tag <run-id> --name prod-baseline

GitHub Actions example:

- name: Eval
  run: agc eval fixtures/ --judge --runs 5

- name: Regression gate
  run: agc compare -l --baseline ${{ vars.BASELINE_RUN_ID }} --threshold 0.05

The JSON output includes p_value, significant, samples_baseline, and samples_current fields for downstream analysis.

Exit codes: 0 = no regression, 1 = regression exceeds threshold, 4 = runtime error.

Dashboard

agc dashboard serves a local web UI from the binary. Open http://localhost:7421 after starting it. Available in the full binary or with feature-flag dashboard.

agc dashboard                        # http://localhost:7421
agc dashboard --port 8080            # custom port
agc dashboard --db path/to/history.db

Reports

# List recent runs
agc report list

# Inspect a run
agc report show <RUN-ID>

# Export as a signed evidence bundle
agc export <RUN-ID>
agc export -l   # latest run

Agent Integration

Every command emits structured JSON when --json is passed or stdout is not a TTY (piped to a file, another process, or an AI coding agent):

# Parse eval results in a pipeline
agc eval fixtures/ --json | jq '.data.summary.pass_rate'

# Machine-readable validate output
agc validate fixtures/ --json | jq '.data.atf_summary'

# Generate fixtures from an agent script
agc generate --extend fixtures/my-skill/ --count 5 --json

Exit codes (consistent across all commands):

Code Meaning
0 Success
1 Failure (tests failed, regression detected)
2 Invalid arguments
3 Config error
4 Runtime error (IO, network, DB)
5 Not found

Configuration

Copy agentcarousel.example.toml to agentcarousel.toml. All configuration options are documented in the example file.

Bundles

A bundle is a signed, distributable archive of a skill's fixture, cases, and evidence.

# Pack a bundle
agc bundle pack fixtures/regex-builder

# Verify bundle integrity
agc bundle verify fixtures/customer-support
agc bundle verify my-bundle.tar.gz

# Pull from registry
agc bundle pull customer-support-1.0.0 --url "https://api.agentcarousel.com"

# Publish to registry
agc publish fixtures/customer-support --url "https://api.agentcarousel.com"

# Publish multiple runs
agc publish fixtures/customer-support \
  --url "https://api.agentcarousel.com" \
  --all-runs --limit 5

Trust Checks

Trust checks query a skill's registry state for use in CI gates and governed workflows — verify a deployed agent is certified and untampered before it runs.

# Check trust state from registry
agc trust-check customer-support@1.0.0 \
  --url "https://api.agentcarousel.com"

# Verify with local attestation
agc trust-check customer-support@1.0.0 \
  --url "https://api.agentcarousel.com" \
  --attestation ./attestation-customer-support-1.0.0.json \
  --minisign-pubkey ./your-minisign.pub

Contributions

For fixture contributions, open an issue before implementation.