swink-agent-eval
Evaluation framework for swink-agent: trajectory tracing, golden-path matching, gate checks, and the spec 043 advanced eval surface for judge-backed scoring, simulation, generation, trace replay, telemetry, and reporting.
Features
Always-on
EvalRunner— drives cases end-to-end against a user-providedAgentFactoryTrajectoryMatcher— match expected tool-call sequences with exact, subset, or ordered modes (MatchMode)ResponseMatcher— assertions on the assistant's final text (substring, regex, semantic)BudgetEvaluator/EfficiencyEvaluator— per-case cost, token, and latency governanceGateConfig— pass/fail gates on aggregate suite results (CI-friendly)FsEvalStore— persist trajectories and scores under a versioned directory layoutConsoleReporter/JsonReporter/MarkdownReporter— deterministic plain-text / JSON / PR-comment Markdown output- Audit log (
AuditedInvocation) — full request/response capture for replay and debugging
Feature-gated (spec 043)
| Feature | Surface |
|---|---|
judge-core |
Prompt-template registry, judge cache/registry, dispatch helpers. |
evaluator-quality |
10 quality-family evaluators (correctness, helpfulness, faithfulness…). |
evaluator-safety |
7 safety-family evaluators (toxicity, PII, prompt-injection…). |
evaluator-rag |
3 RAG evaluators + Embedder trait. |
evaluator-agent |
9 agent-behaviour evaluators (trajectory accuracy, tone…). |
evaluator-simple |
Deterministic ExactMatch + LevenshteinDistance. |
evaluator-structured |
Deterministic JsonMatch + JsonSchema. |
evaluator-code |
Code-quality + harness-based evaluators. |
evaluator-sandbox |
Sandboxed execution evaluator (Unix rlimit FFI). |
multimodal |
ImageSafetyEvaluator with attachment materialization. |
all-evaluators |
Umbrella feature enabling all of the above. |
simulation |
ActorSimulator + ToolSimulator multi-turn scenarios. |
generation |
ExperimentGenerator + TopicPlanner case synthesis. |
trace-ingest |
OtelInMemoryTraceProvider, session mappers, extractors. |
trace-otlp |
OtlpHttpTraceProvider (OTel collector push/pull). |
trace-langfuse |
LangfuseTraceProvider (REST). |
trace-opensearch |
OpenSearchTraceProvider (_search API). |
trace-cloudwatch |
CloudWatchTraceProvider (caller-supplied CloudWatchLogsFetcher). |
telemetry |
EvalsTelemetry span bridge for cargo otel pipelines. |
html-report |
HtmlReporter (self-contained artifact, askama templates). |
langsmith |
LangSmithExporter — push runs + feedback to LangSmith. |
cli |
Builds the swink-eval binary (run/report/gate subcommands). |
yaml |
load_eval_set_yaml plus YAML-aware swink-eval parsing. |
live-judges (external) |
Enabled on swink-agent-eval-judges to reach real provider APIs. |
Quick recipes
# Core eval only (default): no new transitive deps beyond 023.
# Judge-backed evaluators + CLI + HTML + LangSmith:
# Trace replay against OpenSearch / CloudWatch:
Quick Start
[]
= "0.9"
= { = "0.9", = ["yaml"] }
= { = "1", = ["full"] }
use ;
async
Advanced recipes
Production judge scoring
[]
= "0.9"
= { = "0.9", = ["judge-core", "evaluator-quality"] }
= { = "0.9", = ["anthropic"] }
use Arc;
use ;
use AnthropicJudgeClient;
let judge_client = new;
let judge_registry = new;
let mut evaluators = new;
evaluators.add?;
Prompt-only reruns with caching
use Arc;
use ;
let cache = new;
let runner = with_defaults
.with_parallelism
.with_num_runs
.with_cache;
Reporters and CLI
# Re-render a saved result without re-executing the agent
use ;
print!;
let markdown = MarkdownReporter.render?;
if let Stdout = new.render?
if let Artifact = new.render?
Architecture
A run is three staged components: a TrajectoryCollector captures every AgentEvent emitted by the loop, Evaluator implementations score the trajectory against an EvalCase's expectations, and EvalStore persists the result. Budget enforcement is attached at agent construction time by converting EvalCase.budget into BudgetPolicy / MaxTurnsPolicy via BudgetConstraints::to_policies(). Matchers are independent building blocks — you can run trajectory, response, and budget checks alone or compose them via EvaluatorRegistry.
No unsafe code (#![forbid(unsafe_code)]). Eval runs never mutate shared state outside the provided EvalStore.
Part of the swink-agent workspace — see the main README for workspace overview and setup, and swink-agent-eval-judges for the provider feature matrix and credential requirements.