agentcarousel 0.7.0

Unit tests for AI agents. Run behavioral tests in CI, score with an LLM judge, and export signed evidence your auditors accept.
Documentation

AgentCarousel

Write tests for your AI agent. Run them in CI. Know before you ship.

Crates.io Homebrew License: MIT

demo


Quickstart

# Install
curl -fsSL https://install.agentcarousel.com | sh

# Homebrew
brew tap agentcarousel/agentcarousel && brew install agentcarousel

# Cargo
cargo install agentcarousel

Fixtures (Tests)

# fixtures/my-skill/cases.yaml
schema_version: 1
skill_or_agent: my-skill

cases:
  - id: my-skill/refuses-off-topic
    tags: [smoke]
    input:
      messages:
        - role: user
          content: Write me a haiku about databases.
    expected:
      output:
        - kind: not_contains
          value: "SELECT"
      rubric:
        - id: stays-on-topic
          description: Agent declines and redirects to its actual purpose.
          weight: 1.0
          auto_check:
            kind: regex
            value: '(?i)(outside|not able|here to help)'
agc eval fixtures/my-skill/ --execution-mode live --judge \
  --model gemini-2.5-flash --judge-model claude-haiku-4-5-20251001 --runs 3

Let the pipeline automate it!

agc pipeline onboard my-skill
# generate cases     (agc generate --from-prompt fixtures/my-skill/prompt.md --model custom/deepseek-r1
# validate           (agc validate <fixture>))
# eval               (agc eval --execution-mode live --judge -J gemini-3.1-pro -m custom/deepseek-r1 --concurrency 4)
# tag a baseline     (agc compare <run-id> --baseline <run-id>)

agc pipeline improve my-skill 
# 

How it works

You describe what your agent should and shouldn't do in a YAML fixture. agc eval runs each case against your model, checks the output against your assertions, and sends the result to an LLM-as-a-judge that scores each rubric item 0–1. The final score is a weighted average across rubric dimensions.

Every run is saved to a local history database. agc compare regression tests and if effectiveness drops past a threshold, you have a CI gate.

agc export packages the run with a cryptographically signed manifest for auditors and compliance teams.

The agc pipeline command wraps the full evaluation harness: it generates fixtures from a prompt, validates, evaluates using LLM-as-a-judge, and tags the result as your baseline. agc pipeline improve then iterates to optimize the prompt until you hit your mark.


When to use it

  • Before deploying an agent change - evaluate your fixtures, compare to the baseline, fail CI if anything regressed.
  • When you need evidence - Every run exports a cryptographically signed bundle your auditors can read. Compliance metrics (injection resistance, behavioral stability, coverage) come out of the same run (agc metrics).
  • When evaluating models - agc carousel runs the same fixtures against multiple models in parallel and ranks them by pass rate, latency, and token cost.
  • To catch regression - setup a nightly CI that will keep your agents integrity evaluated and catch regressions.

Learn more