AgentCarousel
Unit tests for AI agents. The only AI testing tool that produces evidence your auditors accept — run behavioral tests in CI, score with an LLM judge, gate on regressions, and export signed bundles ready for procurement teams and government regulators.
AgentCarousel delivers a repeatable, automated way to assess AI agent efficacy and behavior — establishing the trust required before deployment. Tests run deterministically in CI, semantic scoring comes from an LLM-as-a-judge, and results can be certified by a domain expert with a signed attestation.
Why agentcarousel
- Behavioral certainty before deployment — Declarative YAML fixtures pin what your agent should and shouldn't say. Same inputs, same outputs, every time — without touching a live API.
- Evidence that stands up to scrutiny — Every run exports a signed bundle (
.tar.gz+ minisign attestation) a domain expert can certify — ready for auditors, procurement teams, and government regulators. - Semantic scoring, not just pattern matching — An LLM-as-a-judge evaluates outputs with contextual understanding, catching regressions.
- Built for regulated environments — Risk tier, data handling, and certification track are first-class fixtures. Integrates into CI and produces governance artifacts for your compliance program.
Install
# Linux / macOS — slim binary (no dashboard)
|
# Linux / macOS — full binary (includes web dashboard UI)
|
# Homebrew (macOS)
&&
# Cargo (Rust)
Two binary variants are available on every release:
| Variant | Asset suffix | Includes |
|---|---|---|
| Slim (default) | (none) | All commands except dashboard |
| Full | -full |
Everything, including agc dashboard |
Upgrade an existing installation to the full variant at any time:
Quickstart
# 1. Scaffold a skill fixture
# 2. Generate cases with an LLM (no hand-writing required)
# 3. Run offline (mock mode, no API keys)
# 4. Evaluate with live generation and an LLM judge
# 5. Export a signed evidence bundle
See fixtures/regex-builder/ for a complete fixture with all cases, golden outputs, and bundle manifest.
Generate Fixtures
agc generate scaffolds validated YAML fixture cases using your configured generator LLM — no hand-writing required.
# From a skill name and description
# From an existing system prompt file
# Extend an existing fixture (deduplicates against existing case IDs)
# Preview without writing
# Machine-readable output (for agent workflows)
Generated cases are validated against the fixture schema before being written. If the LLM output fails validation, the command retries once with the errors appended to the prompt. The meta-prompt lives at templates/generate-prompt.md — teams can customize it to specify what "good coverage" means for their domain.
Typical workflow:
# edit fixtures/customer-support/prompt.md
Live Eval with LLM-as-a-Judge
# Generator LLM key (the model being tested)
# or OPENAI_API_KEY / OPENROUTER_API_KEY
# Judge LLM key (the model scoring outputs)
# or bring your own provider
# Run skill fixtures against live APIs with LLM judge
Execution modes — --execution-mode live hits real LLM APIs. Omit it (or pass mock) for deterministic offline runs.
Evaluators — --evaluator all honors each case's declared evaluator. --evaluator judge routes every case through the LLM judge regardless. --evaluator mock skips LLM calls entirely.
Filters — --filter on skill/case-id; --filter-tags accepts comma-separated tags (e.g. database, safety)
CI Regression Gate
agc compare compares two eval runs and exits 1 when effectiveness regresses beyond a threshold — drop it into any CI pipeline as a binary pass/fail gate.
# Compare the latest run to an explicit baseline
# Auto-baseline: finds previous run for the same skill
# Tag a run as a named baseline for CI reference
# JSON output for downstream tooling
GitHub Actions example:
- name: Eval
run: agc eval fixtures/ --judge --runs 3
- name: Regression gate
run: agc compare -l --baseline ${{ vars.BASELINE_RUN_ID }} --threshold 0.05
Exit codes: 0 = no regression, 1 = regression exceeds threshold, 2 = error.
Dashboard
agc dashboard serves a local web UI from the binary — zero config. Open http://localhost:7421 after starting it. Available in the full binary variant.
Pages:
/— Run history index with headline metrics (total runs, pass rate, mean effectiveness) and trend sparklines/runs/:id— Run detail: per-case effectiveness, inline expansion with trace steps, rubric scores, and judge rationale/compare?a=:id&b=:id— Side-by-side run comparison with delta badges and regression highlighting; deep-linkable URL/review?run=:id— Judge review screen: annotate each LLM judge call as ✓ correct / ✗ wrong / ~ borderline; annotations persist toreviews.jsonland are included inagc exportevidence bundles
Install the full variant to get dashboard access:
|
# or upgrade in-place:
Reports
# List recent runs
# Inspect a run
# Export as a signed evidence bundle
Agent Integration
Every command emits structured JSON when --json is passed or stdout is not a TTY (piped to a file, another process, or an AI coding agent):
# Parse eval results in a pipeline
|
# Machine-readable validate output
|
# Generate fixtures from an agent script
Success envelope:
Error envelope:
Exit codes (consistent across all commands):
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Failure (tests failed, regression detected) |
| 2 | Invalid arguments |
| 3 | Config error |
| 4 | Runtime error (IO, network, DB) |
| 5 | Not found |
Configuration
Copy agentcarousel.example.toml to agentcarousel.toml. All configuration options are documented in the example file.
Bundles
A bundle is a signed, distributable archive of a skill's fixture, cases, and evidence.
# Pack a bundle
# Verify bundle integrity
# Pull from registry
# Publish to registry
# Publish multiple runs
Trust Checks
Trust checks query a skill's registry state for use in CI gates and governed workflows — verify a deployed agent is certified and untampered before it runs.
# Check trust state from registry
# Verify with local attestation
Contributions
- Start here:
CONTRIBUTING.md - Security policy:
SECURITY.md - Changelog:
CHANGELOG.md
For fixture contributions, open an issue before implementation.