AgentCarousel
Unit tests for AI agents. Run behavioral tests in CI, score outputs with an LLM judge, gate on regressions, rank which model costs and performs the best, generate compliance metrics for auditors, and export auditable signed bundles.
Install · Quickstart · LLM-as-a-Judge Eval · Multi-Model Comparisons · A/B Tests · Compliance Metrics · Reports & Export
Why agentcarousel
- Behavioral certainty before deployment: Declarative YAML fixtures pin what your agent should and shouldn't say. Same inputs, same outputs, every time, without touching a live API.
- Evidence that stands up to scrutiny: Every run exports a signed bundle (
.tar.gz+ minisign attestation) that a domain expert can certify, ready for auditors, procurement teams, and government regulators. - Semantic scoring, not just pattern matching: An LLM-as-a-judge evaluates outputs with contextual understanding, catching regressions that keyword checks miss.
- Built for regulated environments: Risk tier, data handling, and certification track are first-class fixtures. Integrates into CI and produces governance artifacts for your compliance program.
Install
# Linux / macOS — slim binary (no dashboard)
|
# Linux / macOS — full binary (includes web dashboard UI)
|
# Homebrew (macOS)
&&
# Cargo (Rust)
Quickstart
# 1. Scaffold a skill fixture
# 2. Generate cases with an LLM (no hand-writing required)
# 3. Run offline (mock mode, no API keys)
# 4. Evaluate with live generation and an LLM judge
# 5. Export a signed evidence bundle
See fixtures/regex-builder/ for a complete fixture with all cases, golden outputs, and bundle manifest.
Generate Fixtures
agc generate scaffolds validated YAML fixture cases using your configured generator LLM. No manual writing required.
# From a skill name and description
# From an existing system prompt file
# Extend an existing fixture (deduplicates against existing case IDs)
Typical fixture workflow:
# edit fixtures/customer-support/prompt.md
Live Eval with LLM-as-a-Judge
# Generator LLM key (the model being tested)
# or OPENAI_API_KEY / OPENROUTER_API_KEY
# Judge LLM key (the model scoring outputs)
# or bring your own provider
# Run skill fixtures against live APIs with LLM judge
Execution modes: --execution-mode live hits real LLM APIs. Omit it (or pass mock) for deterministic offline runs.
Evaluators: --evaluator all honors each case's declared evaluator. --evaluator judge routes every case through the LLM judge regardless. Omit --evaluator (or pass rules) for assertion-based scoring with no LLM judge calls.
Filters: --filter matches on skill/case-id; --filter-tags accepts comma-separated tags (e.g. database, safety).
Token and cost tracking: After each run, agc eval prints token and USD cost totals (generator and judge separately) when live API data is available. The same values are saved to run.json and report.md for audit trails.
Multi-Model Comparison
agc carousel runs the same fixture suite against multiple models in parallel and prints a ranked comparison table showing pass rate, effectiveness score, latency p50, token usage, and cost per model. Every model's run is saved to history so agc compare and the dashboard compare view work immediately.
# Rank three models head-to-head
# With judge scoring
# OpenRouter models (free and open-weight)
Recommended workflow for the most complete ranking:
# 1. Record and promote golden outputs for all your fixtures
# 2. Rules-based baseline across models
# 3. Judge scoring for deeper signal
# 4. Drill into the top two
A/B Prompt Comparison
agc ab runs the same fixture against two prompts concurrently and produces a head-to-head comparison. Pass rate, effectiveness score, per-case winners, and cost per variant.
# Compare two prompt variants (mock, no API key needed)
# Live generation
# With judge scoring
CI Regression Gate
agc compare exits 1 when effectiveness drops below a baseline. With ≥5 scored cases, a Mann-Whitney U test also requires p < --significance so one noisy run does not fail CI.
# Compare latest run to the previous run for the same skill (auto-baseline)
# Or pin a named local baseline
GitHub Actions (Regression Gate):
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Persist the history DB across runs so agc compare has a previous run to diff against
- uses: actions/cache@v4
with:
path: .agentcarousel.db
key: agentcarousel-${{ github.ref }}
restore-keys: agentcarousel-
- run: curl -fsSL https://install.agentcarousel.com | sh
# Run 3 times per case to reduce variance from model non-determinism
- run: agc eval fixtures/ --execution-mode live --evaluator all --judge --runs 3
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
# Exit 1 if effectiveness drops >5% vs the previous run for the same skill
- run: agc compare -l --threshold 0.05
Exit codes: 0 pass, 1 regression, 4 runtime error, 5 run/baseline not found.
Dashboard
agc dashboard serves a local web UI from the binary. Open http://localhost:7421 after starting it. Available in the full binary or with feature-flag dashboard.
Compliance Metrics
agc metrics generates a compliance-ready report with four cross-domain measurements. The output is designed to be read by auditors, procurement reviewers, and compliance teams — not just engineers.
# Metrics from run history + auto-discovered fixtures
# Point directly at fixture files
# Export machine-readable JSON for an evidence bundle
# Widen the history window
What each metric measures:
| Metric | What it tells you | Source |
|---|---|---|
| Prompt Injection Resistance | How reliably the agent blocks adversarial injection attempts (0–100) | Run history |
| Behavioral Stability | Whether effectiveness scores are drifting over time — stable, improving, or degrading | Run history |
| Test Coverage Completeness | What fraction of the risk taxonomy (happy path, edge cases, adversarial, error handling, etc.) the fixture suite covers | Fixture files |
| Score Accuracy (Calibration) | Whether the automated judge scores actually predict pass/fail outcomes | Judged run history |
How --skill and --fixture interact:
--skill my-skillautomatically loads fixture files fromfixtures/my-skill/(errors if the directory doesn't exist)--fixture fixtures/my-skill/reads the skill name from the fixture files and filters run history automatically- Providing both flags validates that they agree — mismatched skill names produce a clear error
Reports
# List recent runs
# Inspect a run (add -v / --verbose for full case details)
# Export as a signed evidence bundle (always includes full detail)
Every exported tarball includes:
run.json— full run record with all case resultsmetrics.json— compliance metrics scoped to the run's skillreport.md— human-readable report with a Compliance Metrics table auditors can read directlyMANIFEST.json— SHA-256 fingerprints of every file in the archiveenvironment_fingerprint.jsonandfixture_bundle.lockfor reproducibility
Exit Codes
Exit codes (consistent across all commands):
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Failure (tests failed, regression detected) |
| 2 | Invalid arguments |
| 3 | Config error |
| 4 | Runtime error (IO, network, DB) |
| 5 | Not found |
Configuration
Copy agentcarousel.example.toml to agentcarousel.toml. All configuration options are documented in the example file.
Bundles
A bundle is a signed, distributable archive of a skill's fixture, cases, and evidence.
# Pack a bundle
# Verify bundle integrity
# Pull from registry (set AGENTCAROUSEL_API_TOKEN first)
# Publish to registry
# Publish multiple runs
Trust Checks
Trust checks query an agent's registry state and confirm a deployed agent is certified and untampered. Use them in CI gates and governed workflows before the agent runs.
# Check trust state from registry
# Verify with local attestation
Contributions
- Contributing guide:
CONTRIBUTING.md - Security policy:
SECURITY.md - Changelog:
CHANGELOG.md
For fixture contributions, open an issue before implementation.