Skip to main content

Module eval

Module eval 

Source
Available on crate feature eval only.
Expand description

Agent evaluation framework.

Test and validate agent behavior:

Available with feature: eval

Modules§

annotation
Human annotation workflow via JSONL export/import.
baseline
Baseline storage for regression detection.
conversation_scorer
Multi-turn conversation metrics.
cost_tracker
Cost and latency tracking for evaluation runs.
criteria
Evaluation criteria definitions
error
Error types for the evaluation framework
evaluator
Core evaluator implementation
llm_judge
LLM-based evaluation scoring
optimizer
Prompt optimization engine.
prelude
Prelude for convenient imports
pricing
Per-model pricing configuration for cost estimation.
report
Evaluation result reporting
schema
Test file schema definitions
scoring
Scoring implementations for evaluation criteria
structured_judge
Structured LLM judge producing typed verdicts.
test_generator
LLM-driven test case generation.
trace_analyzer
Execution trace analysis for detecting inefficiencies.

Structs§

AnnotationRecord
A single annotation record for human review.
AnnotationStore
Manages JSONL export and import for human annotation.
Baseline
Baseline file content containing metric snapshots.
BaselineStore
Manages baseline persistence and regression detection.
ConversationMetrics
Multi-turn conversation quality metrics.
ConversationScorer
Scores multi-turn conversations on quality metrics.
ConversationScorerConfig
Configuration for conversation scoring thresholds.
CostMetrics
Cost and latency metrics for a single evaluation turn.
CostTracker
Tracks cost and latency metrics from agent event streams.
EvalCase
A single evaluation case (test case)
EvalCaseMetadata
Metadata for generated eval cases.
EvalSet
An eval set references multiple test files
EvaluationConfig
Configuration for the evaluator
EvaluationCriteria
Collection of evaluation criteria
EvaluationReport
Complete evaluation report for a test file or eval set
EvaluationResult
Result for a single test case
Evaluator
The main evaluator struct
Failure
A single failure in evaluation
GeneratorConfig
Configuration for test case generation.
HumanVerdict
Human-provided verdict for an evaluation case.
IntermediateData
Intermediate data during a turn (tool calls, etc.)
JudgeRubric
Custom rubric for structured judging.
LlmJudge
LLM-based judge for semantic evaluation
LlmJudgeConfig
Configuration for the LLM judge
ModelPricing
Per-model pricing configuration.
OptimizationResult
Result of a prompt optimization run.
OptimizerConfig
Configuration for the prompt optimization loop.
PromptOptimizer
Iteratively improves an agent’s system instructions using an optimizer LLM and an evaluation set.
Regression
A regression detected between baseline and current run.
ResponseMatchConfig
Configuration for response matching
ResponseScorer
Scorer for response text similarity
Rubric
A single rubric for quality assessment
RubricConfig
Configuration for rubric-based evaluation
RubricEvaluationResult
Result of rubric-based evaluation
RubricScore
Score for a single rubric
ScalePoint
A single point on a rubric scoring scale.
SemanticMatchResult
Result of semantic similarity evaluation
SessionInput
Session input configuration
StructuredJudge
Structured LLM judge that produces typed verdicts.
StructuredJudgeConfig
Configuration for the structured judge.
StructuredVerdict
Verdict from the structured judge.
TestFile
A complete test file containing multiple evaluation cases
TestGenerator
Generates evaluation test cases from descriptions or event logs.
ToolCallRecord
A single tool call record for direct analysis without full Events.
ToolTrajectoryConfig
Configuration for tool trajectory matching
ToolTrajectoryScorer
Scorer for tool trajectory matching
ToolUse
A tool use (function call)
TraceAnalysis
Summary of trace analysis results.
TraceAnalyzer
Analyzes agent execution traces for inefficiencies.
TraceDiagnostic
A detected trace inefficiency.
Turn
A single turn in a conversation

Enums§

EvalError
Errors that can occur during evaluation
TracePattern
Types of trace inefficiency patterns.
Verdict
Categorical outcome of a structured judgment.

Type Aliases§

Result
Result type alias for evaluation operations
TestCaseResult
Result for a single test case (alias for backward compatibility)