assay-runner-spike 3.12.0

Use Assay if you already have machine-readable AI outcomes or agent tool-call tests and want a small reviewable artifact boundary in CI.

Start with the path that matches what you already have:

You have	Use this when	What you get	Next click
Promptfoo JSONL from CI evals	You want smaller PR evidence than a full eval export	Eval outcome receipts, verified bundle, Trust Basis diff	Promptfoo JSONL
OpenFeature boolean `EvaluationDetails`	You want CI evidence for a runtime flag decision boundary	Decision receipt, verified bundle, Trust Basis diff	OpenFeature EvaluationDetails
CycloneDX ML-BOM model component	You want CI evidence for the model inventory/provenance boundary that existed	Inventory receipt, verified bundle, Trust Basis diff	CycloneDX ML-BOM
MCP tool calls	You are ready to put a policy file around tool execution	Allow/deny audit trail and evidence for observed tool behavior	MCP Quick Start
A GitHub PR gate	You want CI to block regressions from checked artifacts	Trust Basis diff, gate status, SARIF/JUnit-ready output	CI Guide

The core workflow is intentionally small: import or record a bounded outcome, bundle and verify it, compile trust-basis.json, then gate the Trust Basis diff. Assay does not make the upstream tool the source of truth; it makes the evidence boundary inspectable.

Trust Basis Gate
Status: OK
Bundles verified: 1
Regressed claims: 0

Assay is not a trust-score engine, a generic eval dashboard, or a hosted observability product. See What Assay is and is not for the boundary.

Is This For Me?

Yes, if you:

already have eval output, runtime decisions, inventory artifacts, or MCP tool-call tests
want a CI review artifact instead of a dashboard-only result
need bounded auditability, not a scalar trust badge

Not yet, if you:

need Assay to judge model correctness or policy quality for you
want a hosted dashboard as the primary product
want a compliance claim instead of a bounded evidence boundary

Install

cargo install assay-cli

CI: GitHub Action. Python SDK: pip install assay-it.

No hosted backend. No API keys for core flows. Deterministic: same input, same decision.

Trust claims use explicit epistemology, not a single “safety score”:

Level	Meaning
`verified`	Backed by direct evidence or offline verification in the bundle/path
`self_reported`	Emitted by the system without stronger independent corroboration
`inferred`	Derived from bounded, documented rules
`absent`	No trustworthy evidence supports the claim

Assay does not ship a primary aggregate trust score or a safe/unsafe badge as the main output. See ADR-033.

What ships today

Output	Role
Policy gate	MCP `wrap` — deterministic allow/deny before tools run (see CLI note below the diagram).
Evidence bundle	Offline-verifiable, tamper-evident archive for audit and replay.
External receipts	Selected eval outcomes, runtime decision details, and inventory/provenance surfaces as bounded evidence receipts with JSON Schema contracts.
Trust Basis	Canonical `trust-basis.json` — bounded claim classification from verified bundles.
Trust Card	`trustcard.json` / `trustcard.md` / `trustcard.html` — same claims, review-friendly artifacts.
SARIF / CI	GitHub Action, Security tab integration, policy gates on PRs.

Repository truth: release notes and CHANGELOG.md remain the authority for what is actually public. main may carry release-prep commits before a tag is cut; crates.io publication is separate from repository merge state.

  Agent ──► Assay ──► MCP Server
              │
              ├─ ✅ ALLOW / ❌ DENY  (policy)
              ├─► 📋 Evidence bundle (verifiable)
              └─► 📊 Trust Basis → Trust Card → SARIF / CI

CLI: The mcp command group is hidden from top-level assay --help while the surface stabilizes; it is supported. Use assay mcp --help, assay mcp wrap …, or follow the MCP Quickstart.

Wedge, not category. “MCP firewall” describes the control plane; trust compilation describes the outcome: reviewable claims backed by evidence. See ADR-033 and RFC-005.

See It Work

cargo install assay-cli

mkdir -p /tmp/assay-demo && echo "safe content" > /tmp/assay-demo/safe.txt

assay mcp wrap --policy examples/mcp-quickstart/policy.yaml \
  -- npx @modelcontextprotocol/server-filesystem /tmp/assay-demo

✅ ALLOW  read_file  path=/tmp/assay-demo/safe.txt  reason=policy_allow
✅ ALLOW  list_dir   path=/tmp/assay-demo/           reason=policy_allow
❌ DENY   read_file  path=/tmp/outside-demo.txt      reason=path_constraint_violation
❌ DENY   exec       cmd=ls                          reason=tool_denied

Inspect the audit artifact:

assay evidence show demo/fixtures/bundle.tar.gz

Evidence Bundle Inspector

The bundle is tamper-evident and cryptographically verifiable. Signed mandate events can include an Ed25519-backed authorization trail for high-risk actions.

Trust artifacts from a verified bundle

After a bundle verifies, compile the claim artifact:

# Machine-readable claim basis (deterministic, claim-first)
assay trust-basis generate demo/fixtures/bundle.tar.gz > trust-basis.json

trust-basis.json is the canonical output for CI and review. Claim id values are stable across runs; consumers should key by id, not row count or order. It is not a scalar trust score.

The current claim-visible receipt families are Promptfoo assertion-component results, OpenFeature boolean EvaluationDetails, and CycloneDX ML-BOM model components. See the receipt-family matrix, the three-family note, and Evidence Receipts in Action.

assay trustcard generate demo/fixtures/bundle.tar.gz --out-dir ./trust-out
# -> trust-out/trustcard.json , trust-out/trustcard.md , trust-out/trustcard.html

The Trust Card is a deterministic render of the same claim rows plus frozen non-goals; trustcard.json is canonical, while Markdown and static HTML are reviewer projections. Contract versions, pack floors, and release checklist: MIGRATION — Trust Compiler 3.2, receipt-family matrix. Release history belongs in CHANGELOG.md.

Add to Cursor in 30 Seconds

Assay ships a helper that finds your local Cursor MCP config path and prints a ready-to-paste entry:

assay mcp config-path cursor

It generates JSON like:

{
  "filesystem-secure": {
    "command": "assay",
    "args": [
      "mcp",
      "wrap",
      "--policy",
      "/path/to/policy.yaml",
      "--",
      "npx",
      "-y",
      "@modelcontextprotocol/server-filesystem",
      "/Users/you"
    ]
  }
}

The same wrapped command works in other MCP clients — see MCP Quick Start.

Policy Is Simple

version: "2.0"
name: "my-policy"

tools:
  allow: ["read_file", "list_dir"]
  deny: ["exec", "shell", "write_file"]

schemas:
  read_file:
    type: object
    additionalProperties: false
    properties:
      path:
        type: string
        pattern: "^/app/.*"
        minLength: 1
    required: ["path"]

Legacy constraints: policies still work. Use assay policy migrate for the v2 JSON Schema form, or assay init --from-trace trace.jsonl to generate from observed behavior.

See Policy Files.

OpenTelemetry in, canonical evidence out

Assay ingests OpenTelemetry JSONL, builds replayable traces, and exports canonical evidence — OTel is a bridge, not the sole semantic authority.

assay trace ingest-otel \
  --input otel-export.jsonl \
  --db .eval/eval.db \
  --out-trace traces/otel.v2.jsonl

See OpenTelemetry & Langfuse.

Protocol adapters

Assay ships adapters that map protocol events into canonical evidence:

Protocol	Adapter	What it maps
ACP (OpenAI/Stripe)	`assay-adapter-acp`	Checkout events, payment intents, tool calls
A2A (Google)	`assay-adapter-a2a`	Agent capabilities, task delegation, artifacts
UCP (Google/Shopify)	`assay-adapter-ucp`	Discover/buy/post-purchase state transitions

Adapter crates are workspace / binary-driven, not published as separate crates.io packages.

Add to CI

# .github/workflows/assay.yml
name: Assay Gate
on: [push, pull_request]
permissions:
  contents: read
  security-events: write
jobs:
  assay:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: Rul1an/assay-action@v2

PRs that violate policy get blocked; SARIF can surface in the Security tab.

Why Assay


Canonical evidence	Assay’s evidence model is the stable contract; OTel and adapters map into it.
Deterministic	Same input, same decision — not probabilistic.
Portable artifacts	Bundles, Trust Basis, Trust Card, SARIF — for CI, review, audit.
Bounded claims	Explicit about what is verified vs visible vs absent — no score-first UX.
MCP-native wedge	`assay mcp wrap` is the fast path (the `mcp` group is hidden from `assay --help`; use `assay mcp --help`). Adapters extend the same engine.
Offline-first	No backend required for core enforcement and bundle verification.

On the M1 Pro/macOS fragmented-IPI harness, protected tool-decision path:

Main protection run: 0.771ms p50 / 1.913ms p95
Fast-path scenario: 0.345ms p50 / 1.145ms p95

These are tool-decision timings, not end-to-end model latency. (See Research & experiments for methodology context.)

Learn More

Promptfoo JSONL to Evidence Receipts — smallest adoption path for existing eval artifacts
OpenFeature EvaluationDetails to CI Review Artifact — runtime decision receipt path
CycloneDX ML-BOM Model to Inventory Receipt — model inventory/provenance receipt path
MCP Quickstart — filesystem server walkthrough
Policy Files — YAML schema for assay mcp wrap
OpenTelemetry & Langfuse — traces → replay and evidence
CI Guide — GitHub Action
Evidence Store — S3, B2, MinIO
ADR-033: Trust compiler positioning
RFC-005: Trust compiler MVP & Trust Card

Internal: Assay-Runner

Assay-Runner is an internal measured-run subsystem used by Assay's delegated Linux/eBPF acceptance path. It is not a standalone product. As of Phase 2D, the runner candidate is split into extraction-ready Rust crates (assay-runner-schema, assay-runner-core, assay-runner-linux) — all publish = false — plus the runner-fixtures/ package tree (Node fixture marked "private": true; Python fixture has no distribution surface). Everything stays inside this repository.

Assay-Runner reference index — internal contracts, boundary map, slice history
Measured-run proof-bundle walkthrough — read-only walkthrough for maintainers evaluating standalone use cases
Phase 2D consolidation audit — current burn-in criteria; the extraction question is closed until the criteria are observed and at least one concrete external use case appears

No release commitment. No timeline. No external demand has been measured.

Research, mappings & experiments

Bounded context: numbers below support mapping and experiments, not a product “security score.”

OWASP MCP Top 10 Mapping — how Assay relates to each risk category (coverage is not a scalar guarantee).
Third-party survey: popular MCP servers often show weak defaults — Assay adds policy + evidence; see discussion in the mapping doc.
Security experiments — attack vectors and harness notes (methodology matters more than headline counts).

Contributing

cargo test --workspace
cargo clippy --workspace --all-targets -- -D warnings

See CONTRIBUTING.md. Discussions: GitHub Discussions — seed topics for pinned threads live in docs/community/DISCUSSIONS.md.

License

MIT