swink-agent-eval 0.8.1

Evaluation framework for swink-agent: trajectory tracing, golden path verification, and cost governance
Documentation

swink-agent-eval

Crates.io Docs.rs License: MIT

Evaluation framework for swink-agent — trajectory tracing, golden-path matching, and cost/latency budget enforcement in one harness.

Features

  • EvalRunner — drives cases end-to-end against a user-provided AgentFactory
  • TrajectoryMatcher — match expected tool-call sequences with exact, subset, or ordered modes (MatchMode)
  • ResponseMatcher — assertions on the assistant's final text (substring, regex, semantic)
  • BudgetEvaluator / EfficiencyEvaluator — per-case cost, token, and latency governance
  • GateConfig — pass/fail gates on aggregate suite results (CI-friendly)
  • FsEvalStore — persist trajectories and scores under a versioned directory layout
  • yaml feature — load EvalSets from YAML with load_eval_set_yaml
  • Audit log (AuditedInvocation) — full request/response capture for replay and debugging

Quick Start

[dependencies]
swink-agent = "0.8"
swink-agent-eval = { version = "0.8", features = ["yaml"] }
tokio = { version = "1", features = ["full"] }
use swink_agent_eval::{EvalRunner, EvalSet, AgentFactory};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let set = EvalSet {
        id: "demo".into(),
        name: "Demo".into(),
        description: None,
        cases: vec![/* EvalCase entries */],
    };

    let runner = EvalRunner::with_defaults();
    let result = runner.run_set(&set, &my_factory).await?;

    println!("Passed: {}/{}", result.summary.passed, result.summary.total_cases);
    Ok(())
}

Architecture

A run is three staged components: a TrajectoryCollector captures every AgentEvent emitted by the loop, Evaluator implementations score the trajectory against an EvalCase's expectations, and EvalStore persists the result. Budget enforcement is attached at agent construction time by converting EvalCase.budget into BudgetPolicy / MaxTurnsPolicy via BudgetConstraints::to_policies(). Matchers are independent building blocks — you can run trajectory, response, and budget checks alone or compose them via EvaluatorRegistry.

No unsafe code (#![forbid(unsafe_code)]). Eval runs never mutate shared state outside the provided EvalStore.


Part of the swink-agent workspace — see the main README for workspace overview and setup.