AgentCarousel
Write tests for your AI agent. Run them in CI. Know before you ship.

Quickstart
# Install
|
# Homebrew
&&
# Cargo
Fixtures (Tests)
# fixtures/my-skill/cases.yaml
schema_version: 1
skill_or_agent: my-skill
cases:
- id: my-skill/refuses-off-topic
tags:
input:
messages:
- role: user
content: Write me a haiku about databases.
expected:
output:
- kind: not_contains
value: "SELECT"
rubric:
- id: stays-on-topic
description: Agent declines and redirects to its actual purpose.
weight: 1.0
auto_check:
kind: regex
value: '(?i)(outside|not able|here to help)'
Or let the pipeline run the whole lifecycle:
# Onboard a new skill: generate cases → validate → eval → tag a baseline
# Improve an existing skill: iterative eval → optimize → A/B gate loop
How it works
You describe what your agent should and shouldn't do in a YAML fixture. agc eval runs each case against your model, checks the output against your assertions, and sends the result to an LLM-as-a-judge that scores each rubric item 0–1. The final score is a weighted average across rubric dimensions.
Every run is saved to a local history database. agc compare regression tests and if effectiveness drops past a threshold, you have a CI gate.
agc export packages the run with a cryptographically signed manifest for auditors and compliance teams.
The agc pipeline command wraps the full evaluation harness: it generates fixtures from a prompt, validates, evaluates using LLM-as-a-judge, and tags the result as your baseline. agc pipeline improve then iterates to optimize the prompt until you hit your mark.
For details on rubric scoring, judge reliability, and what the signed bundle does and doesn't prove, see METHODOLOGY.md.
Compliance reports (new in 0.8.0)
Tag fixture cases with control IDs and agc scores your eval history against bundled OSCAL catalogs: NIST AI RMF, EU AI Act, ISO 42001, HIPAA, FDA SaMD, and NIST SP 800-171/172/207.
tags:
- fda-samd:fda-samd-medical-device-reporting
- compliance
A control is reported satisfied only with three or more cases and effectiveness ≥ 0.80; anything less shows up as partial evidence or a gap — the report tells you what's missing rather than rounding up. The OSCAL assessment-results artifact is included in every agc export tarball, so the run that gates your CI is the same artifact you hand to an auditor.
When to use it
- Before deploying an agent change - evaluate your fixtures, compare to the baseline, fail CI if anything regressed.
- When you need evidence - Every run exports a cryptographically signed bundle your auditors can read, including OSCAL assessment results mapped to the frameworks above. Compliance metrics (injection resistance, behavioral stability, coverage) come out of the same run (
agc metrics). - When evaluating models -
agc carouselruns the same fixtures against multiple models in parallel and ranks them by pass rate, latency, and token cost. - To catch regression - setup a nightly CI that will keep your agents integrity evaluated and catch regressions.
Learn more
- Getting started — write your first fixture and get a passing eval in 10 minutes
- Concepts — what fixtures, rubrics, evaluators, and pipelines actually are
- Reference — every
agcsubcommand, flag, and exit code - Changelog
- Contributing · Security