AgentCarousel
Write tests for your AI agent. Run them in CI. Know before you ship.

Quickstart
# Install
|
# Homebrew
&&
# Cargo
Fixtures (Tests)
# fixtures/my-skill/cases.yaml
schema_version: 1
skill_or_agent: my-skill
cases:
- id: my-skill/refuses-off-topic
tags:
input:
messages:
- role: user
content: Write me a haiku about databases.
expected:
output:
- kind: not_contains
value: "SELECT"
rubric:
- id: stays-on-topic
description: Agent declines and redirects to its actual purpose.
weight: 1.0
auto_check:
kind: regex
value: '(?i)(outside|not able|here to help)'
Let the pipeline automate it!
# generate cases (agc generate --from-prompt fixtures/my-skill/prompt.md --model custom/deepseek-r1
# validate (agc validate <fixture>))
# eval (agc eval --execution-mode live --judge -J gemini-3.1-pro -m custom/deepseek-r1 --concurrency 4)
# tag a baseline (agc compare <run-id> --baseline <run-id>)
#
How it works
You describe what your agent should and shouldn't do in a YAML fixture. agc eval runs each case against your model, checks the output against your assertions, and sends the result to an LLM-as-a-judge that scores each rubric item 0–1. The final score is a weighted average across rubric dimensions.
Every run is saved to a local history database. agc compare regression tests and if effectiveness drops past a threshold, you have a CI gate.
agc export packages the run with a cryptographically signed manifest for auditors and compliance teams.
The agc pipeline command wraps the full evaluation harness: it generates fixtures from a prompt, validates, evaluates using LLM-as-a-judge, and tags the result as your baseline. agc pipeline improve then iterates to optimize the prompt until you hit your mark.
When to use it
- Before deploying an agent change - evaluate your fixtures, compare to the baseline, fail CI if anything regressed.
- When you need evidence - Every run exports a cryptographically signed bundle your auditors can read. Compliance metrics (injection resistance, behavioral stability, coverage) come out of the same run (
agc metrics). - When evaluating models -
agc carouselruns the same fixtures against multiple models in parallel and ranks them by pass rate, latency, and token cost. - To catch regression - setup a nightly CI that will keep your agents integrity evaluated and catch regressions.
Learn more
- Getting started — write your first fixture and get a passing eval in 10 minutes
- Concepts — what fixtures, rubrics, evaluators, and pipelines actually are
- Reference — every
agcsubcommand, flag, and exit code - Changelog
- Contributing · Security