AgentCarousel
Unit tests for AI agents. Determine trust before you deploy - run behavioral tests in CI, score with an LLM judge, and export signed evidence your auditors accept.
AgentCarousel delivers a repeatable, automated way to assess AI agent efficacy and behavior — establishing the trust required before deployment. Tests run deterministically in CI, semantic scoring comes from an LLM-as-a-judge, and results can be certified by a domain expert with a signed attestation.
Why agentcarousel
- Behavioral certainty before deployment — Declarative YAML fixtures pin what your agent should and shouldn't say. Same inputs, same outputs, every time — without touching a live API.
- Evidence that stands up to scrutiny — Every run exports a signed bundle (
.tar.gz+ minisign attestation) a domain expert can certify — ready for auditors, procurement teams, and government regulators. - Semantic scoring, not just pattern matching — An LLM-as-a-judge evaluates outputs with contextual understanding, catching regressions.
- Built for regulated environments — Risk tier, data handling, and certification track are first-class fixtures. Integrates into CI and produces governance artifacts for your compliance program.
Install
# Linux install; for Windows, download .zip from Releases
|
# Homebrew (macOS)
&&
# Cargo (Rust)
Quickstart
# Scaffold a skill fixture
# Run (mock mode by default, no API keys needed)
# Validate fixture schema and rules
# Evaluate fixtures
# Export the last evaluation as an evidence tarball
See fixtures/regex-builder/ for the full fixture with all cases, golden outputs, and bundle manifest.
Live Eval with LLM-as-a-Judge
# Generator LLM key (the model being tested)
# or OPENAI_API_KEY / OPENROUTER_API_KEY
# Judge LLM key (the model being judge)
# or bring your own provider
# Run skill fixtures for regex-builder against live APIs with LLM judge
Execution modes — --execution-mode live hits real LLM APIs. Omit it (or pass mock) for deterministic offline runs.
Evaluators — --evaluator all honors each case's declared evaluator. --evaluator judge routes every case through the LLM judge regardless. --evaluator mock skips LLM calls entirely.
Filters — --filter on skill/case-id; --filter-tags accepts comma-separated tags (e.g. database, safety)
Reports
# List recent runs
# Inspect a run
# Export as a signed evidence bundle
Configuration
Copy agentcarousel.example.toml to agentcarousel.toml. All configuration options are documented in the example file.
Bundles
A bundle is a signed, distributable archive of a skill's fixture, cases, and evidence.
# Pack a bundle
# Verify bundle integrity
# Pull from registry
# Publish to registry
# Publish multiple runs
Trust Checks
Trust checks query a skill's registry state for use in CI gates and governed workflows — verify a deployed agent is certified and untampered before it runs.
# Check trust state from registry
# Verify with local attestation
Contributions
- Start here:
CONTRIBUTING.md - Security policy:
SECURITY.md - Changelog:
CHANGELOG.md
For fixture contributions, open an issue before implementation.