shadow-diff 3.0.6

Behavior contracts for AI agents — tested in your PR, enforced at runtime. Core engine: parser, writer, content-addressed store, replay, and nine-axis behavioral differ.
Documentation

Shadow

ci license spec

version

rust python

Behavior contracts for AI agents.

Tested at PR-time. Enforced at runtime. Same YAML.

Why this exists

Your teammate opens a PR that tweaks the system prompt, swaps GPT-4o for a cheaper model, or adjusts a tool schema. Code review looks fine. Unit tests pass. You merge.

A week later a customer reports that the refund agent started issuing refunds without confirming the amount. The prompt edit dropped the "ask before refunding" step. The PR that caused it was merged days ago. Nobody saw it coming because the code looked harmless.

That's the bug class Shadow exists to catch. Agent behavior silently changed. Tests still pass. Code review can't see it. Shadow turns "how should this agent behave?" into a YAML contract — your CI tests every PR against it; your runtime enforces it against every tool call. Same rule, both places.

What Shadow does, in one screen

Given a baseline .agentlog and a candidate change, Shadow answers three questions on the PR:

  1. What behaviour changed? A nine-axis diff scores response meaning, tool calls, refusals, length, latency, cost, output format, and more — with a plain-English summary on top.
  2. Why did it change? If the PR touched multiple things at once, regression attribution names the specific change that most likely explains each regression.
  3. Is it safe to merge? A YAML policy declares rules the agent must follow (tool ordering, output shape, forbidden outputs). The same policy enforces at runtime.

The report lands in the PR comment. No dashboard, no separate login, no trace upload. Traces stay on your disk.

Install

Step 1. Check you have Python 3.11 or newer:

python3 --version    # 3.11.x or higher

Step 2. Install Shadow from PyPI:

pip install shadow-diff

That's it. shadow --help should now work, and shadow quickstart runs the demo. No clone, no Rust toolchain, no separate setup step — pre-built wheels ship for Linux x86_64, macOS arm64 (Apple Silicon), and Windows x86_64.

On other platforms (Intel Mac, ARM Linux, older glibc, Alpine, FreeBSD), pip falls back to the source distribution and builds the Rust core locally. Install Rust first:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
pip install shadow-diff

Optional extras

Shadow's core install is lean. Most users want one of:

pip install 'shadow-diff[anthropic]'   # if your agent uses Claude
pip install 'shadow-diff[openai]'      # if your agent uses GPT
pip install 'shadow-diff[embeddings]'  # better paraphrase-robust diff
Extra Pulls in Use case
shadow-diff[anthropic] anthropic Live Anthropic client wrapper for shadow record
shadow-diff[openai] openai Live OpenAI client wrapper for shadow record
shadow-diff[embeddings] sentence-transformers Paraphrase-robust semantic-similarity axis. The default lexical TF-IDF path stays fast and dependency-free; the Embedder trait accepts any backend (this extra, ONNX runtime, HF Inference API, OpenAI embeddings, …).
shadow-diff[otel] opentelemetry-sdk Export traces to any OTel-compatible backend
shadow-diff[serve] fastapi, uvicorn, websockets shadow serve HTTP dashboard
shadow-diff[mcp] mcp shadow mcp-serve Model Context Protocol server
shadow-diff[multimodal] Pillow, imagehash Image / multimodal diff
shadow-diff[sign] sigstore Sigstore keyless signing for ABOM certificates
shadow-diff[langgraph] langgraph LangGraph agent adapter
shadow-diff[crewai] crewai CrewAI agent adapter
shadow-diff[ag2] ag2 AG2 / Autogen agent adapter
shadow-diff[all] everything above One-shot install for trying the full feature surface
shadow-diff[dev] test/lint/type-check tooling Contributing to Shadow itself

Combine extras with comma-separated form:

pip install 'shadow-diff[anthropic,openai,embeddings,otel]'

Telemetry: off by default, opt-in only

Shadow ships an opt-in usage-telemetry hook. When enabled, each event includes: CLI command / event name, SDK version, Python version, OS, CPU architecture, and an anonymous install ID (random UUID4 generated once, stored at ~/.shadow/install_id, never tied to identity). No traces, no prompts, no user data. Telemetry is off by default — the first-run prompt asks before enabling. CI environments are detected and skipped automatically. Hard-disable with SHADOW_TELEMETRY=off in your shell. Source: python/src/shadow/_telemetry.py.

Shadow never uploads .agentlog content, prompt text, response text, or tool arguments. The only fields collected when telemetry is enabled are listed in the module docstring.

Try it in sixty seconds

shadow demo

That runs a real diff on bundled .agentlog fixtures. No API key, no agent code, nothing written to your working directory. The output looks like this (abbreviated):

signal             baseline  candidate    change     severity
─────────────────────────────────────────────────────────────
response meaning      1.000      0.435    -0.565     severe
tool calls            0.000      0.000    +0.000     none
refusals              0.000      0.333    +0.333     severe
response length      26.000     52.000   +26.000     minor
response time        98.000    412.000  +314.000     severe
output format         1.000      0.000    -1.000     severe

top divergences:
  #1  turn 0 — tool set changed: removed `search_files(query)`,
                                  added `search_files(limit,query)`
  #2  turn 2 — stop_reason changed: `end_turn` → `content_filter`

recommendations:
  error   Refusal rate is up severely. Check for stricter system instructions.
  error   Review tool-schema change at turn 0: call shape diverged.
  warning Review response text at turn 1: semantic content shifted.

Three things to read top-to-bottom: the severity column tells you which signals moved, the top-divergences list names the specific changes, and the recommendations tell you what to check first. A reviewer doesn't need to know any Shadow vocabulary — the recommendation lines speak plain English.

Use shadow quickstart when you want a writable copy of the demo files (agent.py, configs, fixtures) to edit and re-run:

shadow quickstart
shadow diff shadow-quickstart/fixtures/baseline.agentlog \
            shadow-quickstart/fixtures/candidate.agentlog

Get a Shadow comment on every PR (≈ 10 minutes)

The end-to-end setup, from a fresh repo to seeing your first Shadow comment land on a real PR. Skip any step you've already done.

1. Install Shadow (one line):

pip install shadow-diff

2. Record a baseline. Wrap the place where your agent runs in a Session block. Shadow auto-instruments OpenAI / Anthropic SDK calls and writes a content-addressed .agentlog file:

# scripts/record_baseline.py
import shadow

with shadow.Session(output="baseline.agentlog"):
    run_my_agent()   # whatever your existing entry-point is
python scripts/record_baseline.py
git add baseline.agentlog && git commit -m "chore: pin agent baseline trace"

The trace stays on your disk and inside your repo. Nothing is uploaded.

3. Drop in the GitHub Action. One command scaffolds a workflow you can commit:

shadow init --github-action
git add .github/workflows/shadow-diff.yml && git commit -m "ci: shadow PR diff"

The generated workflow runs your agent against the same inputs on every PR, diffs the new run against baseline.agentlog, posts a comment, and (optionally) blocks the merge if a severe regression is detected.

4. Open a PR. The Shadow comment lands automatically. It's a markdown comment with a verdict line ("Shadow recommends: hold this PR for review"), one-bullet plain-English recommendations, and a collapsible details section for reviewers who want the numbers. See docs/sample-pr-comment.md for what it looks like.

That's it. After this point everything below is optional depth.

Writing behavior rules

The diff tells you what changed. A policy tells you what is not allowed to change. Write one YAML file that declares the agent's behavioral contract:

# shadow-policy.yaml
rules:
  - id: confirm-before-refund
    kind: must_call_before
    params: { first: confirm_refund_amount, then: issue_refund }
    severity: error

  - id: never-leak-ssn
    kind: forbidden_text
    params: { text: "SSN:" }
    severity: error

  - id: finish-cleanly
    kind: required_stop_reason
    params: { allowed: [end_turn, tool_use] }
    severity: error

  - id: cost-ceiling
    kind: max_total_tokens
    params: { limit: 100000 }

Run:

shadow diff baseline.agentlog candidate.agentlog --policy shadow-policy.yaml

The candidate trace is checked against every rule. Violations that are new in the candidate are flagged as regressions. Violations that existed in the baseline and are now cleared are flagged as fixes. Twelve rule kinds ship today: must_call_before, must_call_once, no_call, max_turns, required_stop_reason, max_total_tokens, must_include_text, forbidden_text, must_match_json_schema, must_remain_consistent, must_followup, must_be_grounded (cheap lexical grounding gate, not NLI-backed faithfulness — see docs/features/policy.md for what it catches and what it doesn't).

must_match_json_schema is the structured-output assertion: every chat response is parsed as JSON and validated against a JSON Schema. Mismatches name the offending dotted path so reviewers see exactly which field broke.

rules:
  - id: structured-output
    kind: must_match_json_schema
    params:
      schema_path: schemas/refund_decision.schema.json
    severity: error

Supply either an inline schema: dict or a schema_path: to a JSON Schema file. NaN / Infinity literals are rejected because they aren't valid JSON per RFC 8259 even though Python's parser accepts them.

Each rule can carry a when: clause that gates it on field-path conditions, so a rule fires only on the matching subset of pairs:

rules:
  - id: confirm-large-refunds
    kind: forbidden_text
    params: { text: "refund issued" }
    when:
      - { path: "request.params.amount", op: ">", value: 500 }
      - { path: "request.model", op: "==", value: "gpt-4.1" }

Supported operators: ==, !=, >, >=, <, <=, in, not_in, contains, not_contains. Multiple conditions AND together. Missing paths quietly don't match (rule is skipped on that pair) instead of crashing the whole check.

This is the part that makes Shadow feel like CI for agents instead of monitoring. See docs/features/policy.md for the full rule reference, conditional gating semantics, and severity → --fail-on mapping.

Block bad behavior at runtime

The same policy file can run inside the SDK to block or replace a violating model response at record time, not just after the fact:

from shadow.policy_runtime import EnforcedSession, PolicyEnforcer

enforcer = PolicyEnforcer.from_policy_file("shadow-policy.yaml")
with EnforcedSession(enforcer=enforcer, output_path="run.agentlog") as s:
    s.record_chat(request=..., response=...)

When a recorded turn introduces a new violation, the session swaps the response for a refusal payload by default (stop_reason: "policy_blocked") so downstream code keeps running. Set on_violation="raise" for hard failure, "warn" for log-only. The enforcer is incremental — whole-trace rules fire once when crossed, not once per recorded record.

For dangerous tools (issue_refund, send_email, execute_sql, delete_user), wrap the tool registry to enforce BEFORE the function runs:

guarded = s.wrap_tools({
    "issue_refund": issue_refund,
    "delete_user": delete_user,
})
result = guarded["delete_user"](user_id="u-42")
# → blocked by no_call rule, real delete_user never called

The wrapper probes the enforcer with a synthesised candidate tool_call record. Tool-sequence rules (no_call, must_call_before, must_call_once) all work pre-dispatch. Response-text rules stay on record_chat. See docs/features/runtime-enforcement.md for the full surface, including standalone wrap_tools(..., records_provider=...) for framework-adapter integrations.

Beyond the basics

Everything above is the load-bearing pitch — install, the 10-minute walkthrough, writing a YAML rule, and runtime enforcement. That's the whole product for most users. What's below are the deeper features each backed by its own doc page; skip whichever isn't relevant to you.

  • Recording real agent tracesshadow record -- python your_agent.py auto-instruments Anthropic and OpenAI SDKs, redacts secrets by default, writes content-addressed .agentlog files. No code changes.
  • Framework adapters — first-class hooks for LangGraph, CrewAI, and AG2. The chat client patches still cover everything; the adapter just pulls in the framework's structural metadata (graph nodes, crew kickoffs, agent boundaries).
  • Sandboxed deterministic replay — replay a candidate trace under a different config without touching production. Real tool functions run with network/subprocess/FS-write blocked; the output is an ordinary .agentlog.
  • OpenTelemetry importshadow import --format otel <export> converts existing OTel GenAI semantic-convention spans to .agentlog. See docs/reference/cli.md for the full flag set.
  • Agent Behavior Certificatesshadow certify produces a content-addressed JSON release artifact (model + system prompt hash + tool-schema hash + policy hash + optional regression-suite rollup), signed via sigstore keyless. shadow verify-cert validates content-addressing and signature against a specific signer identity.
  • MCP servershadow mcp-serve exposes Shadow's diff / certify / verify / policy-check capabilities to any MCP-aware client over stdio. Lets agentic CLIs (Claude Desktop, Cline, etc.) treat Shadow as a tool.
  • Production trace miningshadow mine <traces> clusters turn-pairs by tool sequence + stop reason and selects representative cases. Compresses a production trace dump into a regression suite. See docs/reference/cli.md.
  • Why regressions happened, not just whatshadow bisect (LASSO-based, stable CLI) attributes each regressing axis to the specific config delta most likely responsible. The opt-in shadow.causal module adds intervention-based ATE with optional bootstrap CIs and back-door adjustment for confounders.
  • The nine behavioral dimensionsresponse meaning, tool calls, refusals, length, response time, cost, reasoning depth, LLM-judge score, output format. Each measured independently with bootstrap 95% confidence intervals; severity bands tested empirically.
  • Statistical, formal, and causal primitives — Hotelling T² with shrinkage, SPRT and mixture-SPRT, conformal coverage with adaptive drift, LTLf model checking with bottom-up DP. These compose with the nine-axis diff to make certificates evidence-backed instead of just claim-backed. Each has its own theory page; the validation suite empirically verifies Type-I rate, power, and coverage.
  • Worked examples — 9 runnable scenarios: refund bot regression after a prompt edit, devops agent with a tool-ordering bug, ER triage with safety rules, harmful-content domain judge, public-incident reproductions (Air Canada / Avianca / NEDA / McDonald's / Replit), production-trace mining, statistical safety audit. Every example runs offline from committed fixtures with no API key required.

Where Shadow fits

Shadow is a CI/repo-native tool. It does not replace your LLM observability platform — it complements one. Most teams will end up using one of each.

Use Langfuse / Helicone / Braintrust for Use Shadow for
Production trace logging + dashboards Repo-native PR comments and merge-gating
Cross-team trace search and visualization Behavior contracts in YAML, enforced at PR-time and runtime
Long-term observability storage Content-addressed release certificates and supply-chain signing
Custom evals you build in their UI Pre-built nine-axis diff + statistical primitives

If you want a hosted dashboard for your traces, use whichever platform you already have. If you want behavior changes blocked in your PR before they merge — and the same rule enforced at runtime so a runtime override can't ship something CI rejected — that's what Shadow ships as a single command.

shadow record -o baseline.agentlog -- python your_agent.py

# change a prompt, swap a model, re-record
shadow record -o candidate.agentlog -- python your_agent.py

shadow diff baseline.agentlog candidate.agentlog

If you want more control (custom tags, a non-default redactor, nested sessions), use the Session context manager:

from shadow.sdk import Session

with Session(output_path="trace.agentlog", tags={"env": "prod"}):
    client.messages.create(model="claude-sonnet-4-6", messages=[...])

Secrets (API keys, emails, credit cards) are redacted by default.

The TypeScript SDK covers the recording side of this same workflow plus a CI-gating decision surface. Numerical analyses that depend on the Rust core (replay, diff, bisect, certify, MCP server) stay on the Python/CLI side:

Feature Python TypeScript
.agentlog write / parse / canonicalisation
Session context manager
Redaction
Distributed-trace (W3C) propagation
OpenAI Chat Completions + Anthropic Messages auto-instrument
OpenAI Responses API auto-instrument
Streaming aggregation in auto-instrument
LTLf evaluator (bottom-up DP, all 10 operators)
Policy gating (no_call, must_call_before, must_call_once, forbidden_text, must_include_text)
gate(records, { rules, ltlFormulas }) CI decision ✅ (via shadow.policy_runtime)
Runtime policy enforcement (EnforcedSession, pre-dispatch tool guards)
shadow certify / --sign / verify-cert ✅ (CLI)
shadow diff / bisect / replay / mine ✅ (CLI)
MCP server (shadow mcp-serve) ✅ (CLI)

The Python SDK and TypeScript SDK ship lockstep at the same version. The .agentlog format itself is the contract — TS-recorded traces feed into Python's shadow diff, shadow certify, and the MCP server without translation. The TS gate decisions are byte-identical to the Python equivalents on the same fixtures (cross-validated by python/tests/test_typescript_parity.py). For deeper analyses (multi-axis diff, bisect, certify), run those from the Python CLI against the TS-recorded trace.

If your agent is built on LangGraph, CrewAI, or AG2, prefer the matching adapter (next section) over auto-instrumentation. Auto-instrument patches .create on the underlying provider SDK, which is a moving target across SDK majors. The framework adapters hook each framework's documented extension surface, which is the more stable contract.

CLI reference

Command Does
shadow demo Run a nine-axis diff against bundled fixtures. One command, no API key, no files written.
shadow quickstart Drop a writable working demo scenario (agent.py, configs, fixtures) to edit and re-run. No API key needed.
shadow init Scaffold a .shadow/ folder. --github-action drops a CI workflow.
shadow record -- <cmd> Run <cmd>, auto-capture its LLM calls. Zero code changes.
shadow replay <cfg> --baseline <trace> Replay baseline through a new config. --partial --branch-at N locks a prefix, replays only the suffix.
shadow diff <baseline> <candidate> Nine-axis behavior diff. --policy <f> to enforce rules. --fail-on {minor,moderate,severe} to gate the merge. --token-diff for per-turn token distribution. --suggest-fixes for LLM-assisted fix proposals.
shadow call <baseline> <candidate> One-line ship-readiness call: ship / hold / probe / stop, with the dominant driver, worst axes (with bootstrap CIs), and suggested next commands. --strict makes hold/probe block; --log records to the ledger.
shadow autopr <baseline> <candidate> Synthesise a Shadow policy from a regression. Pure deterministic — emits rules in the existing 12-kind language; --verify (default on) confirms each rule fires on the candidate and stays silent on the baseline.
shadow bisect <cfg-a> <cfg-b> --traces <set> Attribute each axis regression to specific config deltas.
shadow ledger Compact panel of recent artifacts: pass rate with 95% Wilson CI, most-concerning entry, suggested next commands. Reads .shadow/ledger/. Opt-in via --log on call/diff or via shadow log.
shadow trail <trace-id> Walk back through the (anchor → candidate) edges from a trace id. Vertical chain showing each step's tier and driver, plus inline commands to re-verify or pin.
shadow brief Tight summary in three formats: terminal (default), markdown (PR comments), slack (Block Kit). --slack-webhook URL posts directly via stdlib.
shadow listen <dir> --anchor <path> Polling-based file-save trigger. Streams a one-line call as each new .agentlog candidate lands in the watched directory.
shadow holdout add/remove/list/reset Manage held-out trace ids (acknowledged-but-not-blocking) with owner tags, reasons, and TTL-based staleness tracking.
shadow log <report.json> Append a diff or call report to the ledger. Default shadow diff writes nothing; this is the explicit way to land an entry from a CI artifact.
shadow schema-watch <cfg-a> <cfg-b> Classify tool-schema changes before replaying.
shadow import <src> --format <fmt> Import foreign traces (langfuse, braintrust, langsmith, openai-evals, otel, mcp, a2a, vercel-ai, pydantic-ai).
shadow mine <traces...> Cluster production traces and pick representative cases as a regression suite.
shadow mcp-serve Run Shadow as a Model Context Protocol server so agentic CLIs can invoke it as a tool.
shadow report <report.json> Re-render a diff as terminal, markdown, or PR-comment.
shadow certify <trace> Generate an Agent Behavior Certificate (ABOM) for a release. --baseline folds in a regression-suite rollup; --policy records its hash. --sign adds a sigstore keyless signature (requires [sign] extra).
shadow verify-cert <cert> Verify a certificate's content-addressed cert_id matches the body. Exits 1 on tamper. --verify-signature --cert-identity <id> also verifies the sigstore signature against the canonical body and a specific signer identity.

Project layout

Shadow/
├── crates/shadow-core/         Rust core: parser, differ, replay, bisect
├── python/                     Python SDK + CLI (maturin-built, ships as shadow-diff on PyPI)
│   ├── src/shadow/
│   └── tests/
├── typescript/                 TypeScript SDK
├── docs/                       mkdocs site (published at manav8498.github.io/Shadow)
├── examples/                   Runnable scenarios (demo, customer-support, devops-agent, er-triage, etc.)
├── benchmarks/                 Scale and correctness benchmarks
├── scripts/                    One-off build and release helpers
├── .github/
│   ├── actions/shadow-action/  Reusable composite action for PR comments
│   ├── workflows/              ci.yml, docs.yml, release.yml
│   └── ISSUE_TEMPLATE/
├── SPEC.md                     The .agentlog format specification (Apache-2.0)
├── CHANGELOG.md                Release notes
├── SECURITY.md                 Security policy and vulnerability reporting
├── CONTRIBUTING.md             How to contribute
├── RELEASE.md                  Maintainer guide: cutting a release, troubleshooting publish failures
├── GOVERNANCE.md               Project governance
├── Cargo.toml                  Rust workspace manifest
├── justfile                    Common dev tasks (just setup, just test, just demo)
├── mkdocs.yml                  Docs site config
└── pricing.json                Per-model token pricing for cost attribution

License

Community

Citing

If you use Shadow in academic work, see CITATION.cff or click "Cite this repository" on the GitHub page.