AgentCarousel
Unit tests for AI agents. Define behavior in YAML, run offline tests, export signed evidence bundles your reviewers will accept.
Why agentcarousel
- Deterministic by default - Offline runs with mocks mean same inputs → same outputs, every time.
- Built for evidence - Every run produces a signed artifact (
.tar.gz+minisignattestation) you can hand to an auditor, a reviewer, or your customer's security team. - Live evals when you want them - plug in OpenAI, Anthropic, Gemini, or OpenRouter as generator and judge, then diff runs to catch regressions.
- Compliance-aware fixtures - Risk tier, data handling, and certification track: the metadata your governance program already tracks, baked into the test format.
Install
# Linux install; for Windows, download .zip from Releases
|
# Homebrew (macOS)
&&
# Cargo (Rust)
Quickstart
# Scaffold a skill or agent fixture
# Run it offline (no API keys needed)
# Validate fixture schema and rules
# Evaluate (mock by default)
# Export evidence bundle
Live Eval with LLM-as-a-judge
# Run all cases, judge-backed cases use the judge
# Narrow to specific cases by id glob or tag
--evaluator all uses each case's declared evaluator; --evaluator judge forces every case through the judge regardless. Use --filter (glob on skill/case-id) or --filter-tags (comma-separated) to scope runs.
Reports
# List recent runs (newest first)
# Show a run (human-readable, same formatting as eval/test output)
# Also accepts a path to run.json or an evidence directory
# Diff two runs to surface regressions
# JSON output for scripting
Configuration (agentcarousel.toml)
Copy agentcarousel.example.toml to agentcarousel.toml and customize as needed.
Per-case effectiveness thresholds override the global --effectiveness-threshold flag via the evaluator_config.effectiveness_threshold field in YAML.
Bundle workflows
# Create a distributable bundle archive
# Verify bundle integrity and structure
# Pull bundle manifest + artifacts from the registry
Publish to registry
# Publish bundle + evidence in one flow
# Publish multiple matching local runs (newest first)
Trust checks
# Registry trust-state check
# Optional offline attestation verification
Shell completions
# Zsh
# Bash
# Fish
Contributions
- Start here:
CONTRIBUTING.md - Security policy:
SECURITY.md - Changelog:
CHANGELOG.md
For fixture contributions, open an issue before implementation.