Crate swink_agent_eval

Expand description

Evaluation framework for swink-agent.

Provides trajectory tracing, golden path verification, response matching, and cost/latency governance for LLM-powered agent loops.

§Quick Start

use swink_agent_eval::{EvalRunner, EvalSet, EvalCase};

let set = EvalSet { id: "demo".into(), name: "Demo".into(), description: None, cases: vec![] };
let runner = EvalRunner::with_defaults();
let result = runner.run_set(&set, &my_factory).await?;
println!("Passed: {}/{}", result.summary.passed, result.summary.total_cases);

Re-exports§

pub use aggregator::Aggregator;
pub use aggregator::AllPass;
pub use aggregator::AnyPass;
pub use aggregator::Average;
pub use aggregator::Weighted;
pub use cache::CacheKey as TaskResultCacheKey;
pub use cache::EvaluationDataStore;
pub use cache::FingerprintContext;
pub use cache::LocalFileTaskResultStore;
pub use cache::StoreError;
pub use cache::canonicalize_fingerprint;
pub use cache::tool_set_hash;
pub use evaluators::agent::AgentToneEvaluator;
pub use evaluators::agent::InteractionsEvaluator;
pub use evaluators::agent::KnowledgeRetentionEvaluator;
pub use evaluators::agent::LanguageDetectionEvaluator;
pub use evaluators::agent::PerceivedErrorEvaluator;
pub use evaluators::agent::TaskCompletionEvaluator;
pub use evaluators::agent::TrajectoryAccuracyEvaluator;
pub use evaluators::agent::TrajectoryAccuracyWithRefEvaluator;
pub use evaluators::agent::UserSatisfactionEvaluator;
pub use evaluators::code::llm_judge::CodeLlmJudgeEvaluator;
pub use evaluators::code::CargoCheckEvaluator;
pub use evaluators::code::ClippyEvaluator;
pub use evaluators::code::CodeExtractor;
pub use evaluators::code::CodeExtractorStrategy;
pub use evaluators::code::SandboxLimits;
pub use evaluators::code::SandboxOutcome;
pub use evaluators::code::SandboxRunner;
pub use evaluators::code::SandboxedExecutionEvaluator;
pub use evaluators::code::ShellRunner;
pub use evaluators::code::run_sandboxed;
pub use evaluators::multimodal::ImageSafetyEvaluator;
pub use evaluators::quality::CoherenceEvaluator;
pub use evaluators::quality::ConcisenessEvaluator;
pub use evaluators::quality::CorrectnessEvaluator;
pub use evaluators::quality::FaithfulnessEvaluator;
pub use evaluators::quality::GoalSuccessRateEvaluator;
pub use evaluators::quality::HallucinationEvaluator;
pub use evaluators::quality::HelpfulnessEvaluator;
pub use evaluators::quality::LazinessEvaluator;
pub use evaluators::quality::PlanAdherenceEvaluator;
pub use evaluators::quality::ResponseRelevanceEvaluator;
pub use evaluators::quality::assertion_implies_goal_completion;
pub use evaluators::rag::DEFAULT_EMBEDDING_SIMILARITY_THRESHOLD;
pub use evaluators::rag::Embedder;
pub use evaluators::rag::EmbedderError;
pub use evaluators::rag::EmbeddingSimilarityEvaluator;
pub use evaluators::rag::RAGGroundednessEvaluator;
pub use evaluators::rag::RAGHelpfulnessEvaluator;
pub use evaluators::rag::RAGRetrievalRelevanceEvaluator;
pub use evaluators::safety::CodeInjectionEvaluator;
pub use evaluators::safety::FairnessEvaluator;
pub use evaluators::safety::HarmfulnessEvaluator;
pub use evaluators::safety::PIIClass;
pub use evaluators::safety::PIILeakageEvaluator;
pub use evaluators::safety::PromptInjectionEvaluator;
pub use evaluators::safety::ToxicityEvaluator;
pub use evaluators::simple::ExactMatchEvaluator;
pub use evaluators::simple::LevenshteinDistanceEvaluator;
pub use evaluators::structured::JsonMatchEvaluator;
pub use evaluators::structured::JsonSchemaEvaluator;
pub use evaluators::structured::KeyStrategy;
pub use evaluators::Detail;
pub use evaluators::DetailBuffer;
pub use evaluators::DispatchError;
pub use evaluators::DispatchOutcome;
pub use evaluators::EvaluatorError;
pub use evaluators::JudgeEvaluatorBuilder;
pub use evaluators::JudgeEvaluatorConfig;
pub use evaluators::dispatch_judge;
pub use evaluators::drive_judge_call;
pub use evaluators::evaluate_with_builtin;
pub use evaluators::finish_metric_result;
pub use evaluators::materialize_case_attachments;
pub use judge::CacheKey;
pub use judge::DEFAULT_JUDGE_CACHE_CAPACITY;
pub use judge::JudgeCache;
pub use judge::JudgeClient;
pub use judge::JudgeError;
pub use judge::JudgeFuture;
pub use judge::JudgeRegistry;
pub use judge::JudgeRegistryBuilder;
pub use judge::JudgeRegistryError;
pub use judge::JudgeVerdict;
pub use judge::RetryPolicy;
pub use prompt::BUILTIN_TEMPLATE_VERSIONS;
pub use prompt::JudgePromptTemplate;
pub use prompt::MinijinjaTemplate;
pub use prompt::PromptContext;
pub use prompt::PromptError;
pub use prompt::PromptFamily;
pub use prompt::PromptTemplateRegistry;
pub use report::HtmlReporter;
pub use report::ConsoleReporter;
pub use report::JsonReporter;
pub use report::MarkdownReporter;
pub use report::Reporter;
pub use report::ReporterError;
pub use report::ReporterOutput;
pub use report::LangSmithExportError;
pub use report::LangSmithExporter;
pub use telemetry::EvalsTelemetry;
pub use telemetry::EvalsTelemetryBuilder;
pub use testing::MockJudge;
pub use testing::PanickingMockJudge;
pub use testing::SlowMockJudge;
pub use trace::LangfuseTraceProvider;
pub use trace::OtlpHttpTraceProvider;
pub use training::ChatMlExporter;
pub use training::DpoExporter;
pub use training::ExportError;
pub use training::ExportOptions;
pub use training::ScoredTrace;
pub use training::ShareGptExporter;
pub use training::TrainingExporter;
pub use training::TrainingFormat;
pub use training::TrainingReporter;
pub use training::export_traces;

Modules§

aggregator: Aggregation strategies for combining evaluator outputs.
cache: Eval runner cache abstractions.
ci: Bundled CI templates for swink-eval consumers (spec 043 T156-T160).
evaluators: Extended evaluator families for advanced eval features.
generation: Experiment generation primitives.
judge: LLM-as-judge client trait and registry for semantic evaluators.
prompt: Prompt templates and rendering infrastructure for judge-backed evaluators.
report: Reporters and export surfaces for eval results.
simulation: Multi-turn simulation support for eval case generation and replay.
telemetry: OpenTelemetry integration for eval runs (spec 043 US7, FR-035).
testing: Test doubles and helpers for the evaluation framework.
trace: Trace ingestion providers, mappers, and extractors (spec 043 US6).
training: RL-compatible training-format trace export (feature: training-export).

Macros§

impl_judge_evaluator_builder: Convenience macro that implements JudgeEvaluatorBuilder for a struct holding a config: JudgeEvaluatorConfig field.

Structs§

Assertion: Judge-evaluated assertion expected to hold after an agent invocation.
AuditedInvocation: An Invocation wrapped with a hash chain for tamper detection.
BudgetConstraints: Budget constraints for cost and latency governance.
BudgetEvaluator: Evaluator that checks cost, token, and turn budgets.
CaseFingerprint: Canonical serializable projection of an EvalCase used for deterministic session IDs and future cache keys.
DefaultUrlFilter: Default SSRF-oriented filter for remote eval assets.
EfficiencyEvaluator: Evaluator that scores trajectory efficiency based on duplicate tool calls and step count relative to an ideal.
EnvironmentState: Named snapshot of an environment state produced by a StateCapture.
EnvironmentStateEvaluator: Deterministic evaluator for environment-side effects.
EvalCase: A single evaluation scenario.
EvalCaseResult: Result of evaluating a single case.
EvalMetricResult: Per-evaluator result for a single case.
EvalRunner: Orchestrates evaluation: runs agents, captures trajectories, and scores results. Default: sequential, num_runs=1, no cache, no cancellation.
EvalSet: A named collection of evaluation cases.
EvalSetResult: Result of evaluating an entire eval set.
EvalSummary: Aggregated statistics for an eval set run.
EvaluatorRegistry: Registry of named evaluators, stored as Arc<dyn Evaluator>.
ExpectedToolCall: A single expected tool invocation in a golden path.
FewShotExample: Example shown to a judge prompt before the case being evaluated.
FsEvalStore: Filesystem-backed eval store using JSON files.
GateConfig: Configuration for CI/CD gate checks against evaluation results.
GateResult: Result of a CI/CD gate check.
InteractionExpectation: Expected interaction between agents, tools, or hand-off participants.
Invocation: Complete trace of an agent run, built by TrajectoryCollector.
MaterializedAttachment: Bytes ready for judge-client payload construction.
RecordedToolCall: A tool call as captured from the agent event stream.
ResponseMatcher: Evaluator that scores the final response text against expected criteria.
RunnerMetricSample: Aggregated per-(case, evaluator) sample surfaced by EvalRunner::with_num_runs. std_dev over the samples quantifies judge non-determinism (research §R-013).
Score: A numeric score in [0.0, 1.0] with a configurable pass threshold.
SemanticToolParameterEvaluator: Semantic tool-parameter evaluator backed by a JudgeClient.
SemanticToolSelectionEvaluator: Semantic tool-selection evaluator backed by a JudgeClient.
ToolIntent: Expected semantic tool intent used by the tool-parameter semantic evaluator.
TrajectoryCollector: Builds an Invocation from a stream of AgentEvents.
TrajectoryMatcher: Evaluator that compares actual tool call trajectories against expected golden paths.
TurnRecord: A single recorded turn from an agent run.

Enums§

AssertionKind: Assertion categories used by judge-backed evaluators.
Attachment: Multimodal attachment reference attached to an evaluation case.
AttachmentError: Structured attachment materialization errors.
EvalError: The top-level error type for eval operations.
MatchMode: How to compare actual tool calls against expected.
ResponseCriteria: Criteria for matching the final response text.
Verdict: Binary pass/fail outcome derived from a Score.

Constants§

CASE_NAMESPACE: Stable namespace for deterministic case-derived session IDs.

Traits§

AgentFactory: Factory that creates a configured Agent for each eval case.
EvalStore: Persistence interface for eval sets and results.
Evaluator: Pluggable evaluator that scores an invocation against an eval case.
UrlFilter: Policy for deciding whether a remote URL is safe to fetch.

Functions§

check_gate: Check evaluation results against gate configuration.
load_eval_set_yaml: Load an EvalSet from a YAML file.
validate_eval_case: Validate a single EvalCase against the case-load rules.
validate_eval_set: Validate an entire EvalSet, short-circuiting on the first invalid case.

Type Aliases§

StateCapture: Callback that captures the environment state after an agent run completes.

Crate swink_agent_eval

Crate swink_agent_eval Copy item path

§Quick Start

Re-exports§

Modules§

Macros§

Structs§

Enums§

Constants§

Traits§

Functions§

Type Aliases§

Crate swink_agent_eval