Available on crate feature
eval only.Expand description
Agent evaluation framework.
Test and validate agent behavior:
Evaluator- Run evaluation suitesEvaluationConfig- Configure evaluation parameters
Available with feature: eval
Modules§
- annotation
- Human annotation workflow via JSONL export/import.
- baseline
- Baseline storage for regression detection.
- conversation_
scorer - Multi-turn conversation metrics.
- cost_
tracker - Cost and latency tracking for evaluation runs.
- criteria
- Evaluation criteria definitions
- error
- Error types for the evaluation framework
- evaluator
- Core evaluator implementation
- llm_
judge - LLM-based evaluation scoring
- optimizer
- Prompt optimization engine.
- prelude
- Prelude for convenient imports
- pricing
- Per-model pricing configuration for cost estimation.
- report
- Evaluation result reporting
- schema
- Test file schema definitions
- scoring
- Scoring implementations for evaluation criteria
- structured_
judge - Structured LLM judge producing typed verdicts.
- test_
generator - LLM-driven test case generation.
- trace_
analyzer - Execution trace analysis for detecting inefficiencies.
Structs§
- Annotation
Record - A single annotation record for human review.
- Annotation
Store - Manages JSONL export and import for human annotation.
- Baseline
- Baseline file content containing metric snapshots.
- Baseline
Store - Manages baseline persistence and regression detection.
- Conversation
Metrics - Multi-turn conversation quality metrics.
- Conversation
Scorer - Scores multi-turn conversations on quality metrics.
- Conversation
Scorer Config - Configuration for conversation scoring thresholds.
- Cost
Metrics - Cost and latency metrics for a single evaluation turn.
- Cost
Tracker - Tracks cost and latency metrics from agent event streams.
- Eval
Case - A single evaluation case (test case)
- Eval
Case Metadata - Metadata for generated eval cases.
- EvalSet
- An eval set references multiple test files
- Evaluation
Config - Configuration for the evaluator
- Evaluation
Criteria - Collection of evaluation criteria
- Evaluation
Report - Complete evaluation report for a test file or eval set
- Evaluation
Result - Result for a single test case
- Evaluator
- The main evaluator struct
- Failure
- A single failure in evaluation
- Generator
Config - Configuration for test case generation.
- Human
Verdict - Human-provided verdict for an evaluation case.
- Intermediate
Data - Intermediate data during a turn (tool calls, etc.)
- Judge
Rubric - Custom rubric for structured judging.
- LlmJudge
- LLM-based judge for semantic evaluation
- LlmJudge
Config - Configuration for the LLM judge
- Model
Pricing - Per-model pricing configuration.
- Optimization
Result - Result of a prompt optimization run.
- Optimizer
Config - Configuration for the prompt optimization loop.
- Prompt
Optimizer - Iteratively improves an agent’s system instructions using an optimizer LLM and an evaluation set.
- Regression
- A regression detected between baseline and current run.
- Response
Match Config - Configuration for response matching
- Response
Scorer - Scorer for response text similarity
- Rubric
- A single rubric for quality assessment
- Rubric
Config - Configuration for rubric-based evaluation
- Rubric
Evaluation Result - Result of rubric-based evaluation
- Rubric
Score - Score for a single rubric
- Scale
Point - A single point on a rubric scoring scale.
- Semantic
Match Result - Result of semantic similarity evaluation
- Session
Input - Session input configuration
- Structured
Judge - Structured LLM judge that produces typed verdicts.
- Structured
Judge Config - Configuration for the structured judge.
- Structured
Verdict - Verdict from the structured judge.
- Test
File - A complete test file containing multiple evaluation cases
- Test
Generator - Generates evaluation test cases from descriptions or event logs.
- Tool
Call Record - A single tool call record for direct analysis without full Events.
- Tool
Trajectory Config - Configuration for tool trajectory matching
- Tool
Trajectory Scorer - Scorer for tool trajectory matching
- ToolUse
- A tool use (function call)
- Trace
Analysis - Summary of trace analysis results.
- Trace
Analyzer - Analyzes agent execution traces for inefficiencies.
- Trace
Diagnostic - A detected trace inefficiency.
- Turn
- A single turn in a conversation
Enums§
- Eval
Error - Errors that can occur during evaluation
- Trace
Pattern - Types of trace inefficiency patterns.
- Verdict
- Categorical outcome of a structured judgment.
Type Aliases§
- Result
- Result type alias for evaluation operations
- Test
Case Result - Result for a single test case (alias for backward compatibility)