AgentCarousel

Unit tests for AI agents. Define behavior in YAML, run offline tests, export signed evidence bundles your reviewers will accept.

Why agentcarousel

Deterministic by default - Offline runs with mocks mean same inputs → same outputs, every time.
Built for evidence - Every run produces a signed artifact (.tar.gz + minisign attestation) you can hand to an auditor, a reviewer, or your customer's security team.
Live evals when you want them - plug in OpenAI, Anthropic, Gemini, or OpenRouter as generator and judge, then diff runs to catch regressions.
Compliance-aware fixtures - Risk tier, data handling, and certification track: the metadata your governance program already tracks, baked into the test format.

Install

# Linux install; for Windows, download .zip from Releases
curl -fsSL https://install.agentcarousel.com | sh

# Homebrew (macOS)
brew tap agentcarousel/agentcarousel && brew install agentcarousel

# Cargo (Rust)
cargo install agentcarousel

Quickstart

# Scaffold a skill or agent fixture
agentcarousel init --skill my-skill
agentcarousel init --agent my-agent

# Run it offline (no API keys needed)
agentcarousel test fixtures/skills/my-skill.yaml --offline true

# Validate fixture schema and rules
agentcarousel validate fixtures/skills/customer-support.yaml

# Evaluate (mock by default)
agentcarousel eval fixtures/skills/customer-support.yaml

# Export evidence bundle
agentcarousel export <RUN-ID>

Live Eval with LLM-as-a-judge

export GEMINI_API_KEY=gemini_key
export OPENROUTER_API_KEY=or_key
export ANTHROPIC_API_KEY=claude_api_key
export OPENAI_API_KEY=openai_key

# Run all cases, judge-backed cases use the judge
agentcarousel eval fixtures/ \
  --execution-mode live \
  --evaluator all --judge \
  --model gemini-2.5-flash \
  --judge-model claude-haiku-4-5-20251001 \
  --runs 1

# Narrow to specific cases by id glob or tag
agentcarousel eval fixtures/ \
  --evaluator all --judge \
  --filter "customer-support/judge-*" \
  --filter-tags certification

--evaluator all uses each case's declared evaluator; --evaluator judge forces every case through the judge regardless. Use --filter (glob on skill/case-id) or --filter-tags (comma-separated) to scope runs.

Reports

# List recent runs (newest first)
agentcarousel report list

# Show a run (human-readable, same formatting as eval/test output)
agentcarousel report show <RUN-ID>

# Also accepts a path to run.json or an evidence directory
agentcarousel report show ./evidence/my-export/

# Diff two runs to surface regressions
agentcarousel report diff <RUN-ID-A> <RUN-ID-B>

# JSON output for scripting
agentcarousel report list --json
agentcarousel report show <RUN-ID> --json

Configuration (`agentcarousel.toml`)

Copy agentcarousel.example.toml to agentcarousel.toml and customize as needed.

Per-case effectiveness thresholds override the global --effectiveness-threshold flag via the evaluator_config.effectiveness_threshold field in YAML.

Bundle workflows

# Create a distributable bundle archive
agentcarousel bundle pack fixtures/bundles/my-bundle --out my-bundle.tar.gz

# Verify bundle integrity and structure
agentcarousel bundle verify fixtures/bundles/customer-support
agentcarousel bundle verify my-bundle.tar.gz

# Pull bundle manifest + artifacts from the registry
agentcarousel bundle pull customer-support-1.0.0 --url "https://api.agentcarousel.com"

Publish to registry

# Publish bundle + evidence in one flow
agentcarousel publish fixtures/bundles/customer-suppport \
  --url "https://api.agentcarousel.com"

# Publish multiple matching local runs (newest first)
agentcarousel publish fixtures/bundles/customer-support \
  --url "https://api.agentcarousel.com" \
  --all-runs --limit 5

Trust checks

# Registry trust-state check
agentcarousel trust-check customer-support@1.0.0 \
  --url "https://api.agentcarousel.com"

# Optional offline attestation verification
agentcarousel trust-check customer-support@1.0.0 \
  --url "https://api.agentcarousel.com" \
  --attestation ./attestation-customer-support-1.0.0.json \
  --minisign-pubkey ./your-minisign.pub

Contributions

Start here: CONTRIBUTING.md
Security policy: SECURITY.md
Changelog: CHANGELOG.md

For fixture contributions, open an issue before implementation.

agentcarousel 0.5.0