Expand description
Evaluation framework for swink-agent.
Provides trajectory tracing, golden path verification, response matching, and cost/latency governance for LLM-powered agent loops.
§Quick Start
ⓘ
use swink_agent_eval::{EvalRunner, EvalSet, EvalCase};
let set = EvalSet { id: "demo".into(), name: "Demo".into(), description: None, cases: vec![] };
let runner = EvalRunner::with_defaults();
let result = runner.run_set(&set, &my_factory).await?;
println!("Passed: {}/{}", result.summary.passed, result.summary.total_cases);Re-exports§
pub use aggregator::Aggregator;pub use aggregator::AllPass;pub use aggregator::AnyPass;pub use aggregator::Average;pub use aggregator::Weighted;pub use cache::CacheKey as TaskResultCacheKey;pub use cache::EvaluationDataStore;pub use cache::FingerprintContext;pub use cache::LocalFileTaskResultStore;pub use cache::StoreError;pub use cache::canonicalize_fingerprint;pub use cache::tool_set_hash;pub use evaluators::agent::AgentToneEvaluator;pub use evaluators::agent::InteractionsEvaluator;pub use evaluators::agent::KnowledgeRetentionEvaluator;pub use evaluators::agent::LanguageDetectionEvaluator;pub use evaluators::agent::PerceivedErrorEvaluator;pub use evaluators::agent::TaskCompletionEvaluator;pub use evaluators::agent::TrajectoryAccuracyEvaluator;pub use evaluators::agent::TrajectoryAccuracyWithRefEvaluator;pub use evaluators::agent::UserSatisfactionEvaluator;pub use evaluators::code::llm_judge::CodeLlmJudgeEvaluator;pub use evaluators::code::CargoCheckEvaluator;pub use evaluators::code::ClippyEvaluator;pub use evaluators::code::CodeExtractor;pub use evaluators::code::CodeExtractorStrategy;pub use evaluators::code::SandboxLimits;pub use evaluators::code::SandboxOutcome;pub use evaluators::code::SandboxRunner;pub use evaluators::code::SandboxedExecutionEvaluator;pub use evaluators::code::ShellRunner;pub use evaluators::code::run_sandboxed;pub use evaluators::multimodal::ImageSafetyEvaluator;pub use evaluators::quality::CoherenceEvaluator;pub use evaluators::quality::ConcisenessEvaluator;pub use evaluators::quality::CorrectnessEvaluator;pub use evaluators::quality::FaithfulnessEvaluator;pub use evaluators::quality::GoalSuccessRateEvaluator;pub use evaluators::quality::HallucinationEvaluator;pub use evaluators::quality::HelpfulnessEvaluator;pub use evaluators::quality::LazinessEvaluator;pub use evaluators::quality::PlanAdherenceEvaluator;pub use evaluators::quality::ResponseRelevanceEvaluator;pub use evaluators::quality::assertion_implies_goal_completion;pub use evaluators::rag::DEFAULT_EMBEDDING_SIMILARITY_THRESHOLD;pub use evaluators::rag::Embedder;pub use evaluators::rag::EmbedderError;pub use evaluators::rag::EmbeddingSimilarityEvaluator;pub use evaluators::rag::RAGGroundednessEvaluator;pub use evaluators::rag::RAGHelpfulnessEvaluator;pub use evaluators::rag::RAGRetrievalRelevanceEvaluator;pub use evaluators::safety::CodeInjectionEvaluator;pub use evaluators::safety::FairnessEvaluator;pub use evaluators::safety::HarmfulnessEvaluator;pub use evaluators::safety::PIIClass;pub use evaluators::safety::PIILeakageEvaluator;pub use evaluators::safety::PromptInjectionEvaluator;pub use evaluators::safety::ToxicityEvaluator;pub use evaluators::simple::ExactMatchEvaluator;pub use evaluators::simple::LevenshteinDistanceEvaluator;pub use evaluators::structured::JsonMatchEvaluator;pub use evaluators::structured::JsonSchemaEvaluator;pub use evaluators::structured::KeyStrategy;pub use evaluators::Detail;pub use evaluators::DetailBuffer;pub use evaluators::DispatchError;pub use evaluators::DispatchOutcome;pub use evaluators::EvaluatorError;pub use evaluators::JudgeEvaluatorBuilder;pub use evaluators::JudgeEvaluatorConfig;pub use evaluators::dispatch_judge;pub use evaluators::drive_judge_call;pub use evaluators::evaluate_with_builtin;pub use evaluators::finish_metric_result;pub use evaluators::materialize_case_attachments;pub use judge::CacheKey;pub use judge::DEFAULT_JUDGE_CACHE_CAPACITY;pub use judge::JudgeCache;pub use judge::JudgeClient;pub use judge::JudgeError;pub use judge::JudgeFuture;pub use judge::JudgeRegistry;pub use judge::JudgeRegistryBuilder;pub use judge::JudgeRegistryError;pub use judge::JudgeVerdict;pub use judge::RetryPolicy;pub use prompt::BUILTIN_TEMPLATE_VERSIONS;pub use prompt::JudgePromptTemplate;pub use prompt::MinijinjaTemplate;pub use prompt::PromptContext;pub use prompt::PromptError;pub use prompt::PromptFamily;pub use prompt::PromptTemplateRegistry;pub use report::HtmlReporter;pub use report::ConsoleReporter;pub use report::JsonReporter;pub use report::MarkdownReporter;pub use report::Reporter;pub use report::ReporterError;pub use report::ReporterOutput;pub use report::LangSmithExportError;pub use report::LangSmithExporter;pub use telemetry::EvalsTelemetry;pub use telemetry::EvalsTelemetryBuilder;pub use testing::MockJudge;pub use testing::PanickingMockJudge;pub use testing::SlowMockJudge;pub use trace::LangfuseTraceProvider;pub use trace::OtlpHttpTraceProvider;pub use training::ChatMlExporter;pub use training::DpoExporter;pub use training::ExportError;pub use training::ExportOptions;pub use training::ScoredTrace;pub use training::TrainingExporter;pub use training::TrainingFormat;pub use training::TrainingReporter;pub use training::export_traces;
Modules§
- aggregator
- Aggregation strategies for combining evaluator outputs.
- cache
- Eval runner cache abstractions.
- ci
- Bundled CI templates for
swink-evalconsumers (spec 043 T156-T160). - evaluators
- Extended evaluator families for advanced eval features.
- generation
- Experiment generation primitives.
- judge
- LLM-as-judge client trait and registry for semantic evaluators.
- prompt
- Prompt templates and rendering infrastructure for judge-backed evaluators.
- report
- Reporters and export surfaces for eval results.
- simulation
- Multi-turn simulation support for eval case generation and replay.
- telemetry
- OpenTelemetry integration for eval runs (spec 043 US7, FR-035).
- testing
- Test doubles and helpers for the evaluation framework.
- trace
- Trace ingestion providers, mappers, and extractors (spec 043 US6).
- training
- RL-compatible training-format trace export (feature:
training-export).
Macros§
- impl_
judge_ evaluator_ builder - Convenience macro that implements
JudgeEvaluatorBuilderfor a struct holding aconfig: JudgeEvaluatorConfigfield.
Structs§
- Assertion
- Judge-evaluated assertion expected to hold after an agent invocation.
- Audited
Invocation - An
Invocationwrapped with a hash chain for tamper detection. - Budget
Constraints - Budget constraints for cost and latency governance.
- Budget
Evaluator - Evaluator that checks cost, token, and turn budgets.
- Case
Fingerprint - Canonical serializable projection of an
EvalCaseused for deterministic session IDs and future cache keys. - Default
UrlFilter - Default SSRF-oriented filter for remote eval assets.
- Efficiency
Evaluator - Evaluator that scores trajectory efficiency based on duplicate tool calls and step count relative to an ideal.
- Environment
State - Named snapshot of an environment state produced by a
StateCapture. - Environment
State Evaluator - Deterministic evaluator for environment-side effects.
- Eval
Case - A single evaluation scenario.
- Eval
Case Result - Result of evaluating a single case.
- Eval
Metric Result - Per-evaluator result for a single case.
- Eval
Runner - Orchestrates evaluation: runs agents, captures trajectories, and scores
results. Default: sequential,
num_runs=1, no cache, no cancellation. - EvalSet
- A named collection of evaluation cases.
- Eval
SetResult - Result of evaluating an entire eval set.
- Eval
Summary - Aggregated statistics for an eval set run.
- Evaluator
Registry - Registry of named evaluators, stored as
Arc<dyn Evaluator>. - Expected
Tool Call - A single expected tool invocation in a golden path.
- FewShot
Example - Example shown to a judge prompt before the case being evaluated.
- FsEval
Store - Filesystem-backed eval store using JSON files.
- Gate
Config - Configuration for CI/CD gate checks against evaluation results.
- Gate
Result - Result of a CI/CD gate check.
- Interaction
Expectation - Expected interaction between agents, tools, or hand-off participants.
- Invocation
- Complete trace of an agent run, built by
TrajectoryCollector. - Materialized
Attachment - Bytes ready for judge-client payload construction.
- Recorded
Tool Call - A tool call as captured from the agent event stream.
- Response
Matcher - Evaluator that scores the final response text against expected criteria.
- Runner
Metric Sample - Aggregated per-(case, evaluator) sample surfaced by
EvalRunner::with_num_runs.std_devover the samples quantifies judge non-determinism (research §R-013). - Score
- A numeric score in
[0.0, 1.0]with a configurable pass threshold. - Semantic
Tool Parameter Evaluator - Semantic tool-parameter evaluator backed by a
JudgeClient. - Semantic
Tool Selection Evaluator - Semantic tool-selection evaluator backed by a
JudgeClient. - Tool
Intent - Expected semantic tool intent used by the tool-parameter semantic evaluator.
- Trajectory
Collector - Builds an
Invocationfrom a stream ofAgentEvents. - Trajectory
Matcher - Evaluator that compares actual tool call trajectories against expected golden paths.
- Turn
Record - A single recorded turn from an agent run.
Enums§
- Assertion
Kind - Assertion categories used by judge-backed evaluators.
- Attachment
- Multimodal attachment reference attached to an evaluation case.
- Attachment
Error - Structured attachment materialization errors.
- Eval
Error - The top-level error type for eval operations.
- Match
Mode - How to compare actual tool calls against expected.
- Response
Criteria - Criteria for matching the final response text.
- Verdict
- Binary pass/fail outcome derived from a
Score.
Constants§
- CASE_
NAMESPACE - Stable namespace for deterministic case-derived session IDs.
Traits§
- Agent
Factory - Factory that creates a configured
Agentfor each eval case. - Eval
Store - Persistence interface for eval sets and results.
- Evaluator
- Pluggable evaluator that scores an invocation against an eval case.
- UrlFilter
- Policy for deciding whether a remote URL is safe to fetch.
Functions§
- check_
gate - Check evaluation results against gate configuration.
- load_
eval_ set_ yaml - Load an
EvalSetfrom a YAML file. - validate_
eval_ case - Validate a single
EvalCaseagainst the case-load rules. - validate_
eval_ set - Validate an entire
EvalSet, short-circuiting on the first invalid case.
Type Aliases§
- State
Capture - Callback that captures the environment state after an agent run completes.