swink-agent-eval

Evaluation framework for swink-agent: trajectory tracing, golden-path matching, gate checks, and the spec 043 advanced eval surface for judge-backed scoring, simulation, generation, trace replay, telemetry, and reporting.

Features

Always-on

EvalRunner — drives cases end-to-end against a user-provided AgentFactory
TrajectoryMatcher — match expected tool-call sequences with exact, subset, or ordered modes (MatchMode)
ResponseMatcher — assertions on the assistant's final text (substring, regex, semantic)
BudgetEvaluator / EfficiencyEvaluator — per-case cost, token, and latency governance
GateConfig — pass/fail gates on aggregate suite results (CI-friendly)
FsEvalStore — persist trajectories and scores under a versioned directory layout
ConsoleReporter / JsonReporter / MarkdownReporter — deterministic plain-text / JSON / PR-comment Markdown output
Audit log (AuditedInvocation) — full request/response capture for replay and debugging

Feature-gated (spec 043)

Feature	Surface
`judge-core`	Prompt-template registry, judge cache/registry, dispatch helpers.
`evaluator-quality`	10 quality-family evaluators (correctness, helpfulness, faithfulness…).
`evaluator-safety`	7 safety-family evaluators (toxicity, PII, prompt-injection…).
`evaluator-rag`	3 RAG evaluators + `Embedder` trait.
`evaluator-agent`	9 agent-behaviour evaluators (trajectory accuracy, tone…).
`evaluator-simple`	Deterministic `ExactMatch` + `LevenshteinDistance`.
`evaluator-structured`	Deterministic `JsonMatch` + `JsonSchema`.
`evaluator-code`	Code-quality + harness-based evaluators.
`evaluator-sandbox`	Sandboxed execution evaluator (Unix rlimit FFI).
`multimodal`	`ImageSafetyEvaluator` with attachment materialization.
`all-evaluators`	Umbrella feature enabling all of the above.
`simulation`	`ActorSimulator` + `ToolSimulator` multi-turn scenarios.
`generation`	`ExperimentGenerator` + `TopicPlanner` case synthesis.
`trace-ingest`	`OtelInMemoryTraceProvider`, session mappers, extractors.
`trace-otlp`	`OtlpHttpTraceProvider` (OTel collector push/pull).
`trace-langfuse`	`LangfuseTraceProvider` (REST).
`trace-opensearch`	`OpenSearchTraceProvider` (`_search` API).
`trace-cloudwatch`	`CloudWatchTraceProvider` (caller-supplied `CloudWatchLogsFetcher`).
`telemetry`	`EvalsTelemetry` span bridge for `cargo otel` pipelines.
`html-report`	`HtmlReporter` (self-contained artifact, `askama` templates).
`langsmith`	`LangSmithExporter` — push runs + feedback to LangSmith.
`cli`	Builds the `swink-eval` binary (`run`/`report`/`gate` subcommands).
`yaml`	`load_eval_set_yaml` plus YAML-aware `swink-eval` parsing.
`live-judges` (external)	Enabled on `swink-agent-eval-judges` to reach real provider APIs.

Quick recipes

# Core eval only (default): no new transitive deps beyond 023.
cargo add swink-agent-eval

# Judge-backed evaluators + CLI + HTML + LangSmith:
cargo add swink-agent-eval --features "all-evaluators,html-report,langsmith,cli"
cargo add swink-agent-eval-judges --features anthropic

# Trace replay against OpenSearch / CloudWatch:
cargo add swink-agent-eval --features "trace-ingest,trace-opensearch,trace-cloudwatch"

Quick Start

[dependencies]
swink-agent = "0.9"
swink-agent-eval = { version = "0.9", features = ["yaml"] }
tokio = { version = "1", features = ["full"] }

use swink_agent_eval::{EvalRunner, EvalSet, AgentFactory};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let set = EvalSet {
        id: "demo".into(),
        name: "Demo".into(),
        description: None,
        cases: vec![/* EvalCase entries */],
    };

    let runner = EvalRunner::with_defaults();
    let result = runner.run_set(&set, &my_factory).await?;

    println!("Passed: {}/{}", result.summary.passed, result.summary.total_cases);
    Ok(())
}

Advanced recipes

Production judge scoring

[dependencies]
swink-agent = "0.9"
swink-agent-eval = { version = "0.9", features = ["judge-core", "evaluator-quality"] }
swink-agent-eval-judges = { version = "0.9", features = ["anthropic"] }

use std::sync::Arc;
use swink_agent_eval::{
    evaluators::{CorrectnessEvaluator, JudgeEvaluatorConfig},
    judge::JudgeRegistry,
    EvaluatorRegistry,
};
use swink_agent_eval_judges::AnthropicJudgeClient;

let judge_client = Arc::new(AnthropicJudgeClient::new(
    "https://api.anthropic.com",
    std::env::var("ANTHROPIC_API_KEY")?,
    "claude-sonnet-4-6",
));
let judge_registry = Arc::new(JudgeRegistry::builder(judge_client, "claude-sonnet-4-6").build()?);

let mut evaluators = EvaluatorRegistry::new();
evaluators.add(CorrectnessEvaluator::new(
    JudgeEvaluatorConfig::default_with(judge_registry),
))?;

Prompt-only reruns with caching

use std::sync::Arc;
use swink_agent_eval::{cache::LocalFileTaskResultStore, EvalRunner};

let cache = Arc::new(LocalFileTaskResultStore::new("./.swink-eval-cache".into()));
let runner = EvalRunner::with_defaults()
    .with_parallelism(8)
    .with_num_runs(3)
    .with_cache(cache);

Reporters and CLI

# Re-render a saved result without re-executing the agent
swink-eval report result.json --format html > report.html

use swink_agent_eval::report::{
    ConsoleReporter, HtmlReporter, JsonReporter, MarkdownReporter, Reporter, ReporterOutput,
};

print!("{}", ConsoleReporter.render(&result)?);
let markdown = MarkdownReporter.render(&result)?;

if let ReporterOutput::Stdout(json) = JsonReporter::new().render(&result)? {
    std::fs::write("result.json", json)?;
}

if let ReporterOutput::Artifact { bytes, .. } = HtmlReporter::new().render(&result)? {
    std::fs::write("report.html", bytes)?;
}

Architecture

A run is three staged components: a TrajectoryCollector captures every AgentEvent emitted by the loop, Evaluator implementations score the trajectory against an EvalCase's expectations, and EvalStore persists the result. Budget enforcement is attached at agent construction time by converting EvalCase.budget into BudgetPolicy / MaxTurnsPolicy via BudgetConstraints::to_policies(). Matchers are independent building blocks — you can run trajectory, response, and budget checks alone or compose them via EvaluatorRegistry.

No unsafe code (#![forbid(unsafe_code)]). Eval runs never mutate shared state outside the provided EvalStore.

Part of the swink-agent workspace — see the main README for workspace overview and setup, and swink-agent-eval-judges for the provider feature matrix and credential requirements.

swink-agent-eval 0.9.0