Shadow
Behavior testing for LLM agents, in the pull request.
Shadow catches behavior regressions in AI agents before they merge. You change a prompt, swap a model, or rename a tool argument. Your agent still runs, tests still pass, but the behavior quietly shifts. Shadow replays your change against recorded agent traces and posts a behavior diff on the PR so a reviewer can see what broke and why.
The problem
You have a working agent in production. A teammate opens a PR that tweaks the system prompt, swaps GPT-4o for a cheaper model, or adjusts a tool schema. Code review looks fine. Unit tests pass. You merge.
A week later a customer reports that the refund bot started issuing refunds without confirming the amount. It turns out the prompt edit dropped the "ask before refunding" step. The PR that caused it was merged days ago. Nobody saw it coming because the code looked harmless.
This is a common class of bug with LLM agents. The agent runs, responses look plausible, tests pass. The behavior just silently changed.
What Shadow does
Shadow treats agent behavior as a thing you can test in CI, the same way you test code. Given a recorded set of real agent interactions (a baseline), and a candidate change (new prompt, new model, renamed tool), Shadow answers three questions on the PR:
- What behavior changed? A nine-axis diff scores the candidate against the baseline on things like meaning, tool use, refusals, latency, and output structure.
- Why did it change? If the PR touched multiple things at once, regression attribution estimates which specific change most likely explains each regression, then points you at the replay / counterfactual primitives to confirm it before merging. The stable CLI uses LASSO-based
shadow bisect; the newer intervention-basedshadow.causalmodule is opt-in. - Is it safe to merge? A policy file lets you declare rules the agent must follow (tool ordering, output shape, token budgets, forbidden outputs). Shadow reports regressions against those rules.
The report lands in the PR comment. No dashboard, no separate login, no trace upload. Traces stay on your disk.
Install
Step 1. Check you have Python 3.11 or newer:
Step 2. Install Shadow from PyPI:
That's it. shadow --help should now work, and shadow quickstart runs the demo. No clone, no Rust toolchain, no separate setup step — pre-built wheels ship for Linux x86_64, macOS arm64 (Apple Silicon), and Windows x86_64.
On other platforms (Intel Mac, ARM Linux, older glibc, Alpine, FreeBSD), pip falls back to the source distribution and builds the Rust core locally. Install Rust first:
|
Optional extras
Shadow's core install is lean. Most users want one of:
| Extra | Pulls in | Use case |
|---|---|---|
shadow-diff[anthropic] |
anthropic |
Live Anthropic client wrapper for shadow record |
shadow-diff[openai] |
openai |
Live OpenAI client wrapper for shadow record |
shadow-diff[embeddings] |
sentence-transformers |
Paraphrase-robust semantic-similarity axis. The default lexical TF-IDF path stays fast and dependency-free; the Embedder trait accepts any backend (this extra, ONNX runtime, HF Inference API, OpenAI embeddings, …). |
shadow-diff[otel] |
opentelemetry-sdk |
Export traces to any OTel-compatible backend |
shadow-diff[serve] |
fastapi, uvicorn, websockets |
shadow serve HTTP dashboard |
shadow-diff[mcp] |
mcp |
shadow mcp-serve Model Context Protocol server |
shadow-diff[multimodal] |
Pillow, imagehash |
Image / multimodal diff |
shadow-diff[sign] |
sigstore |
Sigstore keyless signing for ABOM certificates |
shadow-diff[langgraph] |
langgraph |
LangGraph agent adapter |
shadow-diff[crewai] |
crewai |
CrewAI agent adapter |
shadow-diff[ag2] |
ag2 |
AG2 / Autogen agent adapter |
shadow-diff[all] |
everything above | One-shot install for trying the full feature surface |
shadow-diff[dev] |
test/lint/type-check tooling | Contributing to Shadow itself |
Combine extras with comma-separated form:
Telemetry: off by default, opt-in only
Shadow ships an opt-in usage-telemetry hook. When enabled, each
event includes: CLI command / event name, SDK version, Python
version, OS, CPU architecture, and an anonymous install ID
(random UUID4 generated once, stored at ~/.shadow/install_id,
never tied to identity). No traces, no prompts, no user data.
Telemetry is off by default — the first-run prompt asks before
enabling. CI environments are detected and skipped automatically.
Hard-disable with SHADOW_TELEMETRY=off in your shell. Source:
python/src/shadow/_telemetry.py.
Shadow never uploads .agentlog content, prompt text, response
text, or tool arguments. The only fields collected when telemetry
is enabled are listed in the module docstring.
Try it in sixty seconds
That runs a real diff on bundled .agentlog fixtures. No API key, no agent code, nothing written to your working directory. The output looks like this (abbreviated):
signal baseline candidate change severity
─────────────────────────────────────────────────────────────
response meaning 1.000 0.435 -0.565 severe
tool calls 0.000 0.000 +0.000 none
refusals 0.000 0.333 +0.333 severe
response length 26.000 52.000 +26.000 minor
response time 98.000 412.000 +314.000 severe
output format 1.000 0.000 -1.000 severe
top divergences:
#1 turn 0 — tool set changed: removed `search_files(query)`,
added `search_files(limit,query)`
#2 turn 2 — stop_reason changed: `end_turn` → `content_filter`
recommendations:
error Refusal rate is up severely. Check for stricter system instructions.
error Review tool-schema change at turn 0: call shape diverged.
warning Review response text at turn 1: semantic content shifted.
Three things to read top-to-bottom: the severity column tells you which signals moved, the top-divergences list names the specific changes, and the recommendations tell you what to check first. A reviewer doesn't need to know any Shadow vocabulary — the recommendation lines speak plain English.
Use shadow quickstart when you want a writable copy of the demo files (agent.py, configs, fixtures) to edit and re-run:
Get a Shadow comment on every PR (≈ 10 minutes)
The end-to-end setup, from a fresh repo to seeing your first Shadow comment land on a real PR. Skip any step you've already done.
1. Install Shadow (one line):
2. Record a baseline. Wrap the place where your agent runs in a Session block. Shadow auto-instruments OpenAI / Anthropic SDK calls and writes a content-addressed .agentlog file:
# scripts/record_baseline.py
# whatever your existing entry-point is
&&
The trace stays on your disk and inside your repo. Nothing is uploaded.
3. Drop in the GitHub Action. One command scaffolds a workflow you can commit:
&&
The generated workflow runs your agent against the same inputs on every PR, diffs the new run against baseline.agentlog, posts a comment, and (optionally) blocks the merge if a severe regression is detected.
4. Open a PR. The Shadow comment lands automatically. It's a markdown comment with a verdict line ("Shadow recommends: hold this PR for review"), one-bullet plain-English recommendations, and a collapsible details section for reviewers who want the numbers. See docs/sample-pr-comment.md for what it looks like.
That's it. After this point everything below is optional depth.
Writing behavior rules
The diff tells you what changed. A policy tells you what is not allowed to change. Write one YAML file that declares the agent's behavioral contract:
# shadow-policy.yaml
rules:
- id: confirm-before-refund
kind: must_call_before
params:
severity: error
- id: never-leak-ssn
kind: forbidden_text
params:
severity: error
- id: finish-cleanly
kind: required_stop_reason
params:
severity: error
- id: cost-ceiling
kind: max_total_tokens
params:
Run:
The candidate trace is checked against every rule. Violations that are new in the candidate are flagged as regressions. Violations that existed in the baseline and are now cleared are flagged as fixes. Twelve rule kinds ship today: must_call_before, must_call_once, no_call, max_turns, required_stop_reason, max_total_tokens, must_include_text, forbidden_text, must_match_json_schema, must_remain_consistent, must_followup, must_be_grounded (cheap lexical grounding gate, not NLI-backed faithfulness — see docs/features/policy.md for what it catches and what it doesn't).
must_match_json_schema is the structured-output assertion: every chat response is parsed as JSON and validated against a JSON Schema. Mismatches name the offending dotted path so reviewers see exactly which field broke.
rules:
- id: structured-output
kind: must_match_json_schema
params:
schema_path: schemas/refund_decision.schema.json
severity: error
Supply either an inline schema: dict or a schema_path: to a JSON Schema file. NaN / Infinity literals are rejected because they aren't valid JSON per RFC 8259 even though Python's parser accepts them.
Each rule can carry a when: clause that gates it on field-path conditions, so a rule fires only on the matching subset of pairs:
rules:
- id: confirm-large-refunds
kind: forbidden_text
params:
when:
-
-
Supported operators: ==, !=, >, >=, <, <=, in, not_in, contains, not_contains. Multiple conditions AND together. Missing paths quietly don't match (rule is skipped on that pair) instead of crashing the whole check.
This is the part that makes Shadow feel like CI for agents instead of monitoring. See docs/features/policy.md for the full rule reference, conditional gating semantics, and severity → --fail-on mapping.
Block bad behavior at runtime
The same policy file can run inside the SDK to block or replace a violating model response at record time, not just after the fact:
=
When a recorded turn introduces a new violation, the session swaps the response for a refusal payload by default (stop_reason: "policy_blocked") so downstream code keeps running. Set on_violation="raise" for hard failure, "warn" for log-only. The enforcer is incremental — whole-trace rules fire once when crossed, not once per recorded record.
For dangerous tools (issue_refund, send_email, execute_sql, delete_user), wrap the tool registry to enforce BEFORE the function runs:
=
=
# → blocked by no_call rule, real delete_user never called
The wrapper probes the enforcer with a synthesised candidate tool_call record. Tool-sequence rules (no_call, must_call_before, must_call_once) all work pre-dispatch. Response-text rules stay on record_chat. See docs/features/runtime-enforcement.md for the full surface, including standalone wrap_tools(..., records_provider=...) for framework-adapter integrations.
Recording real agent traces
Shadow's SDK auto-instruments the Anthropic and OpenAI SDKs. No code changes to the agent itself:
# change a prompt, swap a model, re-record
If you want more control (custom tags, a non-default redactor, nested sessions), use the Session context manager:
Secrets (API keys, emails, credit cards) are redacted by default.
The TypeScript SDK covers the recording side of this same workflow plus a CI-gating decision surface. Numerical analyses that depend on the Rust core (replay, diff, bisect, certify, MCP server) stay on the Python/CLI side:
| Feature | Python | TypeScript |
|---|---|---|
.agentlog write / parse / canonicalisation |
✅ | ✅ |
Session context manager |
✅ | ✅ |
| Redaction | ✅ | ✅ |
| Distributed-trace (W3C) propagation | ✅ | ✅ |
| OpenAI Chat Completions + Anthropic Messages auto-instrument | ✅ | ✅ |
| OpenAI Responses API auto-instrument | ✅ | ✅ |
| Streaming aggregation in auto-instrument | ✅ | ✅ |
| LTLf evaluator (bottom-up DP, all 10 operators) | ✅ | ✅ |
Policy gating (no_call, must_call_before, must_call_once, forbidden_text, must_include_text) |
✅ | ✅ |
gate(records, { rules, ltlFormulas }) CI decision |
✅ (via shadow.policy_runtime) |
✅ |
Runtime policy enforcement (EnforcedSession, pre-dispatch tool guards) |
✅ | ❌ |
shadow certify / --sign / verify-cert |
✅ (CLI) | ❌ |
shadow diff / bisect / replay / mine |
✅ (CLI) | ❌ |
MCP server (shadow mcp-serve) |
✅ (CLI) | ❌ |
The Python SDK and TypeScript SDK ship lockstep at the same version. The .agentlog format itself is the contract — TS-recorded traces feed into Python's shadow diff, shadow certify, and the MCP server without translation. The TS gate decisions are byte-identical to the Python equivalents on the same fixtures (cross-validated by python/tests/test_typescript_parity.py). For deeper analyses (multi-axis diff, bisect, certify), run those from the Python CLI against the TS-recorded trace.
If your agent is built on LangGraph, CrewAI, or AG2, prefer the matching adapter (next section) over auto-instrumentation. Auto-instrument patches .create on the underlying provider SDK, which is a moving target across SDK majors. The framework adapters hook each framework's documented extension surface, which is the more stable contract.
Record from agent frameworks
If your agent runs on a framework, Shadow has a direct hook for each of the three most common ones. Install the matching extra and drop the handler in; no monkey-patch, nothing to rewrite in the agent.
LangGraph / LangChain
=
pip install 'shadow-diff[langgraph]'. Works under invoke and ainvoke. The thread_id from the config carries through as the session boundary, so one invoke is one session even across tool loops and fan-outs.
CrewAI
pip install 'shadow-diff[crewai]'. One Crew.kickoff() is one session, even when it triggers many LLM calls; the adapter marks the boundary on CrewKickoffStartedEvent.
AG2 (formerly AutoGen)
=
pip install 'shadow-diff[ag2]'. Captures the message bodies that autogen.opentelemetry redacts by default, so semantic diffs have something to compare against.
Replay the candidate, end-to-end, without touching production
For a candidate change to a prompt or model, shadow diff shows what's different between two recorded traces. Sandboxed replay drives the candidate's agent loop forward against a baseline and produces a candidate trace without making any real LLM calls or running any real tool side effects:
--tool-backend replay resolves every tool call against the baseline's recorded results. --novel-tool-policy decides what happens when the candidate calls a tool the baseline didn't (strict aborts, stub returns a placeholder, fuzzy matches the nearest same-tool call by arg shape). For real tool functions with side effects you'd otherwise hit, the programmatic API exposes SandboxedToolBackend which patches socket.connect, subprocess.run, and write-mode open() calls during execution. Counterfactual primitives (branch_at_turn, replace_tool_result, replace_tool_args) let you isolate one variable at a time. See docs/features/sandboxed-replay.md.
Import traces from any OpenTelemetry backend
If you already export OTLP to Datadog, Honeycomb, or any OTel collector, pipe that same export into Shadow:
Reads the full GenAI semantic convention v1.40 surface: structured gen_ai.input.messages / gen_ai.output.messages, gen_ai.provider.name, cache tokens, tool definitions, agent spans, evaluation events. Also accepts the older v1.28-v1.36 flat indexed attributes, so traces from OpenLLMetry and similar implementers that haven't tracked the v1.37 restructure still round-trip cleanly.
Wire it into every pull request
Drops a ready-to-commit workflow at .github/workflows/shadow-diff.yml. Point the BASELINE and CANDIDATE paths at fixtures you commit, and every PR gets a behavior-diff comment.
To gate the merge, add --fail-on severe (or moderate / minor) to the shadow diff step. The PR comment posts first; the gate runs as a separate step so a blocked PR still has the explanation.
Exits 1 when the worst axis severity or policy regression hits the threshold; 0 otherwise.
Sign every release with an Agent Behavior Certificate
Produces a content-addressed JSON release artifact (Agent Behavior Bill of Materials) that captures the trace's content-id, all distinct models observed, content-ids of system prompts and tool schemas, the policy file hash, and an optional baseline-vs-candidate nine-axis regression-suite rollup. The certificate is self-verifying: verify-cert recomputes the body hash and exits 1 on tamper, so it works as a release gate.
Add --sign to layer cosign / sigstore keyless signing on top:
The signed payload is the canonical certificate body, so tampering breaks both cert_id and the signature. The signature is bound to a specific signer identity (a workflow URL or email) — a leaked Bundle signed by another identity won't verify even if the crypto is otherwise valid. See docs/features/certificate.md for the full format, signing details, and MCP integration.
Use Shadow from an agentic CLI (MCP server)
Shadow speaks the Model Context Protocol. Any MCP-aware client (Claude Desktop, Claude Code, Cursor, Zed, Windsurf, and others) can invoke Shadow as a tool:
Tools exposed: shadow_diff, shadow_check_policy, shadow_token_diff, shadow_schema_watch, shadow_summarise, shadow_certify, shadow_verify_cert. Install the extra first: pip install 'shadow-diff[mcp]'. See docs/features/mcp-server.md for the per-tool reference.
Mine production traces into a regression suite
Most teams never write eval sets because it's tedious. Let Shadow do it from your production traces:
Clusters every turn-pair by tool sequence, stop reason, and verbosity, picks the most interesting example from each cluster (errors, refusals, high cost, heavy reasoning, very long or empty responses), and writes a new .agentlog you can commit as your CI baseline.
Why regressions happened, not just that they happened
When a PR changes three things at once (prompt + model + tool schema), a diff alone cannot tell you which one broke the agent. shadow bisect fits a sparse linear model (LASSO over corners with Meinshausen-Bühlmann stability selection) that attributes each behavioral axis's regression to specific config deltas:
Output:
attribution:
trajectory ← search_files.arguments.limit added (weight 0.72)
semantic ← system_prompt line 42 changed (weight 0.19)
latency ← model: claude-haiku → gpt-4o-mini (weight 0.61)
The review comment tells you: "72% of the trajectory regression is explained by the tool-schema change. Revert that line and the agent should behave."
The nine behavioral dimensions
Each dimension is measured independently with a bootstrap 95% confidence interval. Severity is one of none, minor, moderate, severe:
| # | Dimension | What it measures |
|---|---|---|
| 1 | semantic |
How different are the outputs' meanings? |
| 2 | trajectory |
Did the agent use a different sequence of tools? |
| 3 | safety |
Did refusal rates change? |
| 4 | verbosity |
Are outputs longer or shorter? |
| 5 | latency |
Is it slower or faster? |
| 6 | cost |
Are token costs up or down? |
| 7 | reasoning |
Is the agent thinking less or more? |
| 8 | judge |
Your own LLM-judge rubric (optional). |
| 9 | conformance |
Does the output match the expected structure? |
Per-axis math, severity bands, and bootstrap details: docs/features/nine-axis.md. The on-disk trace format is in SPEC.md.
Where Shadow fits among existing tools
| Langfuse | Braintrust | LangSmith | Shadow | |
|---|---|---|---|---|
| Trace logging | ✅ | ✅ | ✅ | ✅ |
| Dashboard UI | ✅ | ✅ | ✅ | no |
| Local-first / repo-native | partial (self-host) | partial (self-host) | no | ✅ |
| PR comment from CI | partial | partial | partial | ✅ |
| Declarative YAML behavior policy | partial via evals | partial via evals | partial via evals | ✅ |
| Merge-blocking PR check | partial via webhooks | partial via webhooks | partial via webhooks | ✅ |
| Content-addressed release certificate | no | no | no | ✅ |
| Cosign / sigstore signing on certificate | no | no | no | ✅ |
| Regression attribution (LASSO bisect, stable CLI) | no | no | no | ✅ |
| Intervention-based causal attribution (foundation, opt-in) | no | no | no | ✅ |
| Nine pre-built behavior axes | partial | partial | partial | ✅ |
| Open content-addressed trace format | no | no | no | ✅ |
The "partial" cells reflect that all three platforms support evals + webhooks + custom CI integrations that a determined team can build into a PR-comment / gate workflow. Shadow's claim isn't that those tools can't be wired up — it's that Shadow ships the workflow as a single command, and ships an open trace format, declarative policy language, and signed release certificate as primitives. Pair Shadow with any of these tools for the dashboard side.
Examples
Every example runs offline from committed fixtures. No API key required:
| Example | What it shows |
|---|---|
examples/demo/ |
The fastest working example. just demo. |
examples/customer-support/ |
Refund bot that regresses after a well-meaning prompt edit |
examples/devops-agent/ |
Database agent with a tool-ordering bug that unit tests would miss |
examples/er-triage/ |
High-stakes clinical scenario with safety rules |
examples/edge-cases/ |
20 adversarial probes used as a regression guard |
examples/refund-agent-audit/ |
Statistical safety audit on a model upgrade (Hotelling T² + SPRT + LTL + conformal) |
examples/canary-monitor/ |
Production canary with always-valid mSPRT and Bonferroni-corrected family-wise error |
examples/harmful-content-judge/ |
Domain-aware harm detection where the safety axis isn't enough |
examples/production-incident-suite/ |
Five public-incident patterns (Air Canada, Avianca, NEDA, McDonald's, Replit) caught by the v2.5+ pipeline |
examples/integrations/ |
Push traces to Datadog, Splunk, or any OTel collector |
Statistical, formal, and causal primitives (v2.5+)
Shadow ships a layer most LLM-eval tools don't have — empirically-validated statistical and formal-methods-inspired primitives that make certificates more evidence-backed. The mix is heterogeneous: rigorous statistical tests (Hotelling T², SPRT, conformal coverage), genuine formal verification on traces (LTLf model checking with bottom-up DP), causal-inference-inspired attribution (intervention-based ATE, foundation), and signed certificates that compose all of the above. Not every certificate carries a formal proof; each component is documented for what it is. These compose with the nine-axis diff above.
| Module | What it does | Reference |
|---|---|---|
shadow.statistical |
Behavioral fingerprinting, Hotelling T² (with OAS shrinkage and permutation p-values), Wald + mixture SPRT, variance-adaptive MSPRTtDetector |
docs/theory/sprt.md, docs/theory/hotelling.md |
shadow.ltl |
Finite-trace LTLf model checking with bottom-up DP (O(|π|×|φ|)); WeakUntil for "must-call-before" rules; YAML compiler |
docs/theory/ltl.md |
shadow.conformal |
Distribution-free split conformal (conformal_calibrate); Adaptive Conformal Inference (ACIDetector, Gibbs & Candès 2021) for distribution shift |
docs/theory/conformal.md |
shadow.causal |
Intervention-based causal attribution foundation, inspired by Pearl-style causal inference: per-delta ATE, optional percentile-bootstrap CIs (Efron 1979), optional back-door adjustment for named confounders when users can supply or justify stratum weights (the default uniform weights are unbiased only under uniform P(C=c)), optional VanderWeele-Ding (2017) E-value sensitivity. Not yet the default shadow bisect engine — that remains LASSO-based. |
docs/theory/causal.md |
shadow.diff_py |
Scenario-aware multi-case diff: partition by meta.scenario_id, run per-scenario diffs without spurious "dropped turns" |
|
shadow.policy_suggest |
Mine baseline traces for must_call_before patterns; suggest policies the operator approves before adding |
|
shadow.storage |
Pluggable Storage interface (FileStore + InMemoryStore in OSS; cloud backends plug in) |
The validation suite at python/tests/test_statistical_validation.py (run with pytest -m slow) empirically verifies Type-I rate, power across an effect-size × n grid, the always-valid bound under continuous peeking, and conformal coverage on heavy-tailed held-out data.
CLI reference
| Command | Does |
|---|---|
shadow demo |
Run a nine-axis diff against bundled fixtures. One command, no API key, no files written. |
shadow quickstart |
Drop a writable working demo scenario (agent.py, configs, fixtures) to edit and re-run. No API key needed. |
shadow init |
Scaffold a .shadow/ folder. --github-action drops a CI workflow. |
shadow record -- <cmd> |
Run <cmd>, auto-capture its LLM calls. Zero code changes. |
shadow replay <cfg> --baseline <trace> |
Replay baseline through a new config. --partial --branch-at N locks a prefix, replays only the suffix. |
shadow diff <baseline> <candidate> |
Nine-axis behavior diff. --policy <f> to enforce rules. --fail-on {minor,moderate,severe} to gate the merge. --token-diff for per-turn token distribution. --suggest-fixes for LLM-assisted fix proposals. |
shadow call <baseline> <candidate> |
One-line ship-readiness call: ship / hold / probe / stop, with the dominant driver, worst axes (with bootstrap CIs), and suggested next commands. --strict makes hold/probe block; --log records to the ledger. |
shadow autopr <baseline> <candidate> |
Synthesise a Shadow policy from a regression. Pure deterministic — emits rules in the existing 12-kind language; --verify (default on) confirms each rule fires on the candidate and stays silent on the baseline. |
shadow bisect <cfg-a> <cfg-b> --traces <set> |
Attribute each axis regression to specific config deltas. |
shadow ledger |
Compact panel of recent artifacts: pass rate with 95% Wilson CI, most-concerning entry, suggested next commands. Reads .shadow/ledger/. Opt-in via --log on call/diff or via shadow log. |
shadow trail <trace-id> |
Walk back through the (anchor → candidate) edges from a trace id. Vertical chain showing each step's tier and driver, plus inline commands to re-verify or pin. |
shadow brief |
Tight summary in three formats: terminal (default), markdown (PR comments), slack (Block Kit). --slack-webhook URL posts directly via stdlib. |
shadow listen <dir> --anchor <path> |
Polling-based file-save trigger. Streams a one-line call as each new .agentlog candidate lands in the watched directory. |
shadow holdout add/remove/list/reset |
Manage held-out trace ids (acknowledged-but-not-blocking) with owner tags, reasons, and TTL-based staleness tracking. |
shadow log <report.json> |
Append a diff or call report to the ledger. Default shadow diff writes nothing; this is the explicit way to land an entry from a CI artifact. |
shadow schema-watch <cfg-a> <cfg-b> |
Classify tool-schema changes before replaying. |
shadow import <src> --format <fmt> |
Import foreign traces (langfuse, braintrust, langsmith, openai-evals, otel, mcp, a2a, vercel-ai, pydantic-ai). |
shadow mine <traces...> |
Cluster production traces and pick representative cases as a regression suite. |
shadow mcp-serve |
Run Shadow as a Model Context Protocol server so agentic CLIs can invoke it as a tool. |
shadow report <report.json> |
Re-render a diff as terminal, markdown, or PR-comment. |
shadow certify <trace> |
Generate an Agent Behavior Certificate (ABOM) for a release. --baseline folds in a regression-suite rollup; --policy records its hash. --sign adds a sigstore keyless signature (requires [sign] extra). |
shadow verify-cert <cert> |
Verify a certificate's content-addressed cert_id matches the body. Exits 1 on tamper. --verify-signature --cert-identity <id> also verifies the sigstore signature against the canonical body and a specific signer identity. |
Project layout
Shadow/
├── crates/shadow-core/ Rust core: parser, differ, replay, bisect
├── python/ Python SDK + CLI (maturin-built, ships as shadow-diff on PyPI)
│ ├── src/shadow/
│ └── tests/
├── typescript/ TypeScript SDK
├── docs/ mkdocs site (published at manav8498.github.io/Shadow)
├── examples/ Runnable scenarios (demo, customer-support, devops-agent, er-triage, etc.)
├── benchmarks/ Scale and correctness benchmarks
├── scripts/ One-off build and release helpers
├── .github/
│ ├── actions/shadow-action/ Reusable composite action for PR comments
│ ├── workflows/ ci.yml, docs.yml, release.yml
│ └── ISSUE_TEMPLATE/
├── SPEC.md The .agentlog format specification (Apache-2.0)
├── CHANGELOG.md Release notes
├── SECURITY.md Security policy and vulnerability reporting
├── CONTRIBUTING.md How to contribute
├── RELEASE.md Maintainer guide: cutting a release, troubleshooting publish failures
├── GOVERNANCE.md Project governance
├── Cargo.toml Rust workspace manifest
├── justfile Common dev tasks (just setup, just test, just demo)
├── mkdocs.yml Docs site config
└── pricing.json Per-model token pricing for cost attribution
License
- Code and spec: Apache License 2.0.
- Name "Shadow" and logo: see TRADEMARK.md.
- Contributions: every commit must carry a Developer Certificate of Origin sign-off (
git commit -s). See CONTRIBUTING.md.
Community
- GitHub Discussions for questions and help
- GitHub Issues for bugs and feature requests
- SECURITY.md to report vulnerabilities privately
- CONTRIBUTING.md to contribute
- Contributor Covenant v2.1
Citing
If you use Shadow in academic work, see CITATION.cff or click "Cite this repository" on the GitHub page.