Skip to main content

Crate swink_agent_eval

Crate swink_agent_eval 

Source
Expand description

Evaluation framework for swink-agent.

Provides trajectory tracing, golden path verification, response matching, and cost/latency governance for LLM-powered agent loops.

§Quick Start

use swink_agent_eval::{EvalRunner, EvalSet, EvalCase};

let set = EvalSet { id: "demo".into(), name: "Demo".into(), description: None, cases: vec![] };
let runner = EvalRunner::with_defaults();
let result = runner.run_set(&set, &my_factory).await?;
println!("Passed: {}/{}", result.summary.passed, result.summary.total_cases);

Re-exports§

pub use aggregator::Aggregator;
pub use aggregator::AllPass;
pub use aggregator::AnyPass;
pub use aggregator::Average;
pub use aggregator::Weighted;
pub use cache::CacheKey as TaskResultCacheKey;
pub use cache::EvaluationDataStore;
pub use cache::FingerprintContext;
pub use cache::LocalFileTaskResultStore;
pub use cache::StoreError;
pub use cache::canonicalize_fingerprint;
pub use cache::tool_set_hash;
pub use evaluators::agent::AgentToneEvaluator;
pub use evaluators::agent::InteractionsEvaluator;
pub use evaluators::agent::KnowledgeRetentionEvaluator;
pub use evaluators::agent::LanguageDetectionEvaluator;
pub use evaluators::agent::PerceivedErrorEvaluator;
pub use evaluators::agent::TaskCompletionEvaluator;
pub use evaluators::agent::TrajectoryAccuracyEvaluator;
pub use evaluators::agent::TrajectoryAccuracyWithRefEvaluator;
pub use evaluators::agent::UserSatisfactionEvaluator;
pub use evaluators::code::llm_judge::CodeLlmJudgeEvaluator;
pub use evaluators::code::CargoCheckEvaluator;
pub use evaluators::code::ClippyEvaluator;
pub use evaluators::code::CodeExtractor;
pub use evaluators::code::CodeExtractorStrategy;
pub use evaluators::code::SandboxLimits;
pub use evaluators::code::SandboxOutcome;
pub use evaluators::code::SandboxRunner;
pub use evaluators::code::SandboxedExecutionEvaluator;
pub use evaluators::code::ShellRunner;
pub use evaluators::code::run_sandboxed;
pub use evaluators::multimodal::ImageSafetyEvaluator;
pub use evaluators::quality::CoherenceEvaluator;
pub use evaluators::quality::ConcisenessEvaluator;
pub use evaluators::quality::CorrectnessEvaluator;
pub use evaluators::quality::FaithfulnessEvaluator;
pub use evaluators::quality::GoalSuccessRateEvaluator;
pub use evaluators::quality::HallucinationEvaluator;
pub use evaluators::quality::HelpfulnessEvaluator;
pub use evaluators::quality::LazinessEvaluator;
pub use evaluators::quality::PlanAdherenceEvaluator;
pub use evaluators::quality::ResponseRelevanceEvaluator;
pub use evaluators::quality::assertion_implies_goal_completion;
pub use evaluators::rag::DEFAULT_EMBEDDING_SIMILARITY_THRESHOLD;
pub use evaluators::rag::Embedder;
pub use evaluators::rag::EmbedderError;
pub use evaluators::rag::EmbeddingSimilarityEvaluator;
pub use evaluators::rag::RAGGroundednessEvaluator;
pub use evaluators::rag::RAGHelpfulnessEvaluator;
pub use evaluators::rag::RAGRetrievalRelevanceEvaluator;
pub use evaluators::safety::CodeInjectionEvaluator;
pub use evaluators::safety::FairnessEvaluator;
pub use evaluators::safety::HarmfulnessEvaluator;
pub use evaluators::safety::PIIClass;
pub use evaluators::safety::PIILeakageEvaluator;
pub use evaluators::safety::PromptInjectionEvaluator;
pub use evaluators::safety::ToxicityEvaluator;
pub use evaluators::simple::ExactMatchEvaluator;
pub use evaluators::simple::LevenshteinDistanceEvaluator;
pub use evaluators::structured::JsonMatchEvaluator;
pub use evaluators::structured::JsonSchemaEvaluator;
pub use evaluators::structured::KeyStrategy;
pub use evaluators::Detail;
pub use evaluators::DetailBuffer;
pub use evaluators::DispatchError;
pub use evaluators::DispatchOutcome;
pub use evaluators::EvaluatorError;
pub use evaluators::JudgeEvaluatorBuilder;
pub use evaluators::JudgeEvaluatorConfig;
pub use evaluators::dispatch_judge;
pub use evaluators::drive_judge_call;
pub use evaluators::evaluate_with_builtin;
pub use evaluators::finish_metric_result;
pub use evaluators::materialize_case_attachments;
pub use judge::CacheKey;
pub use judge::DEFAULT_JUDGE_CACHE_CAPACITY;
pub use judge::JudgeCache;
pub use judge::JudgeClient;
pub use judge::JudgeError;
pub use judge::JudgeFuture;
pub use judge::JudgeRegistry;
pub use judge::JudgeRegistryBuilder;
pub use judge::JudgeRegistryError;
pub use judge::JudgeVerdict;
pub use judge::RetryPolicy;
pub use prompt::BUILTIN_TEMPLATE_VERSIONS;
pub use prompt::JudgePromptTemplate;
pub use prompt::MinijinjaTemplate;
pub use prompt::PromptContext;
pub use prompt::PromptError;
pub use prompt::PromptFamily;
pub use prompt::PromptTemplateRegistry;
pub use report::HtmlReporter;
pub use report::ConsoleReporter;
pub use report::JsonReporter;
pub use report::MarkdownReporter;
pub use report::Reporter;
pub use report::ReporterError;
pub use report::ReporterOutput;
pub use report::LangSmithExportError;
pub use report::LangSmithExporter;
pub use telemetry::EvalsTelemetry;
pub use telemetry::EvalsTelemetryBuilder;
pub use testing::MockJudge;
pub use testing::PanickingMockJudge;
pub use testing::SlowMockJudge;
pub use trace::LangfuseTraceProvider;
pub use trace::OtlpHttpTraceProvider;
pub use training::ChatMlExporter;
pub use training::DpoExporter;
pub use training::ExportError;
pub use training::ExportOptions;
pub use training::ScoredTrace;
pub use training::ShareGptExporter;
pub use training::TrainingExporter;
pub use training::TrainingFormat;
pub use training::TrainingReporter;
pub use training::export_traces;

Modules§

aggregator
Aggregation strategies for combining evaluator outputs.
cache
Eval runner cache abstractions.
ci
Bundled CI templates for swink-eval consumers (spec 043 T156-T160).
evaluators
Extended evaluator families for advanced eval features.
generation
Experiment generation primitives.
judge
LLM-as-judge client trait and registry for semantic evaluators.
prompt
Prompt templates and rendering infrastructure for judge-backed evaluators.
report
Reporters and export surfaces for eval results.
simulation
Multi-turn simulation support for eval case generation and replay.
telemetry
OpenTelemetry integration for eval runs (spec 043 US7, FR-035).
testing
Test doubles and helpers for the evaluation framework.
trace
Trace ingestion providers, mappers, and extractors (spec 043 US6).
training
RL-compatible training-format trace export (feature: training-export).

Macros§

impl_judge_evaluator_builder
Convenience macro that implements JudgeEvaluatorBuilder for a struct holding a config: JudgeEvaluatorConfig field.

Structs§

Assertion
Judge-evaluated assertion expected to hold after an agent invocation.
AuditedInvocation
An Invocation wrapped with a hash chain for tamper detection.
BudgetConstraints
Budget constraints for cost and latency governance.
BudgetEvaluator
Evaluator that checks cost, token, and turn budgets.
CaseFingerprint
Canonical serializable projection of an EvalCase used for deterministic session IDs and future cache keys.
DefaultUrlFilter
Default SSRF-oriented filter for remote eval assets.
EfficiencyEvaluator
Evaluator that scores trajectory efficiency based on duplicate tool calls and step count relative to an ideal.
EnvironmentState
Named snapshot of an environment state produced by a StateCapture.
EnvironmentStateEvaluator
Deterministic evaluator for environment-side effects.
EvalCase
A single evaluation scenario.
EvalCaseResult
Result of evaluating a single case.
EvalMetricResult
Per-evaluator result for a single case.
EvalRunner
Orchestrates evaluation: runs agents, captures trajectories, and scores results. Default: sequential, num_runs=1, no cache, no cancellation.
EvalSet
A named collection of evaluation cases.
EvalSetResult
Result of evaluating an entire eval set.
EvalSummary
Aggregated statistics for an eval set run.
EvaluatorRegistry
Registry of named evaluators, stored as Arc<dyn Evaluator>.
ExpectedToolCall
A single expected tool invocation in a golden path.
FewShotExample
Example shown to a judge prompt before the case being evaluated.
FsEvalStore
Filesystem-backed eval store using JSON files.
GateConfig
Configuration for CI/CD gate checks against evaluation results.
GateResult
Result of a CI/CD gate check.
InteractionExpectation
Expected interaction between agents, tools, or hand-off participants.
Invocation
Complete trace of an agent run, built by TrajectoryCollector.
MaterializedAttachment
Bytes ready for judge-client payload construction.
RecordedToolCall
A tool call as captured from the agent event stream.
ResponseMatcher
Evaluator that scores the final response text against expected criteria.
RunnerMetricSample
Aggregated per-(case, evaluator) sample surfaced by EvalRunner::with_num_runs. std_dev over the samples quantifies judge non-determinism (research §R-013).
Score
A numeric score in [0.0, 1.0] with a configurable pass threshold.
SemanticToolParameterEvaluator
Semantic tool-parameter evaluator backed by a JudgeClient.
SemanticToolSelectionEvaluator
Semantic tool-selection evaluator backed by a JudgeClient.
ToolIntent
Expected semantic tool intent used by the tool-parameter semantic evaluator.
TrajectoryCollector
Builds an Invocation from a stream of AgentEvents.
TrajectoryMatcher
Evaluator that compares actual tool call trajectories against expected golden paths.
TurnRecord
A single recorded turn from an agent run.

Enums§

AssertionKind
Assertion categories used by judge-backed evaluators.
Attachment
Multimodal attachment reference attached to an evaluation case.
AttachmentError
Structured attachment materialization errors.
EvalError
The top-level error type for eval operations.
MatchMode
How to compare actual tool calls against expected.
ResponseCriteria
Criteria for matching the final response text.
Verdict
Binary pass/fail outcome derived from a Score.

Constants§

CASE_NAMESPACE
Stable namespace for deterministic case-derived session IDs.

Traits§

AgentFactory
Factory that creates a configured Agent for each eval case.
EvalStore
Persistence interface for eval sets and results.
Evaluator
Pluggable evaluator that scores an invocation against an eval case.
UrlFilter
Policy for deciding whether a remote URL is safe to fetch.

Functions§

check_gate
Check evaluation results against gate configuration.
load_eval_set_yaml
Load an EvalSet from a YAML file.
validate_eval_case
Validate a single EvalCase against the case-load rules.
validate_eval_set
Validate an entire EvalSet, short-circuiting on the first invalid case.

Type Aliases§

StateCapture
Callback that captures the environment state after an agent run completes.