adk-eval
Agent evaluation framework for Rust Agent Development Kit (ADK-Rust).
Overview
adk-eval provides comprehensive tools for testing and validating agent behavior, enabling developers to ensure their agents perform correctly and consistently. Unlike traditional software testing, agent evaluation must account for the probabilistic nature of LLMs while still providing meaningful quality signals.
Features
- Test Definitions: Structured JSON format for defining test cases (
.test.json) - Trajectory Evaluation: Validate tool call sequences with exact or partial matching
- Response Quality: Assess final output quality using multiple algorithms
- LLM-Judged Evaluation: Semantic matching, rubric-based scoring, and safety checks
- Multiple Criteria: Ground truth, similarity-based, and configurable thresholds
- Detailed Reporting: Comprehensive results with failure analysis
- Structured LLM Judge: Typed verdicts (pass/fail/partial) with scores and reasoning
- Embedding Similarity: Cosine similarity between embedding vectors (feature:
embedding) - Cost & Latency Tracking: Token usage extraction, dollar cost estimation, latency recording
- Trace Analysis: Detect redundant tool calls, execution loops, compute efficiency scores
- Regression Baselines: Save/load metric snapshots, detect quality degradation
- JUnit XML Output: CI-friendly report generation (feature:
ci-helpers) - Human Annotation: JSONL export/import workflow for human review
- A/B Comparison: Statistical significance testing with Wilcoxon signed-rank (feature:
statistics) - Test Case Generation: LLM-driven or event-based eval case creation
- Conversation Metrics: Multi-turn scoring for context retention, goal completion, coherence, topic drift
- CLI Integration:
cargo adk evalwith baselines, regression checks, and parallel execution
Quick Start
use ;
use Arc;
async
Test File Format
Test files use JSON format with the following structure:
Evaluation Criteria
Tool Trajectory Matching
Validates that the agent calls expected tools:
let criteria = EvaluationCriteria ;
Response Similarity
Compare response text using various algorithms:
let criteria = EvaluationCriteria ;
Available similarity algorithms:
Exact- Exact string matchContains- Substring checkLevenshtein- Edit distanceJaccard- Word overlap (default)Rouge1- Unigram overlapRouge2- Bigram overlapRougeL- Longest common subsequence
LLM-Judged Semantic Matching
Use an LLM to judge semantic equivalence between expected and actual responses:
use ;
use GeminiModel;
use Arc;
// Create evaluator with LLM judge
let judge_model = new;
let config = with_criteria;
let evaluator = with_llm_judge;
Rubric-Based Evaluation
Evaluate responses against custom rubrics:
use ;
let criteria = default
.with_rubrics;
Safety and Hallucination Detection
Check responses for safety issues and hallucinations:
let criteria = EvaluationCriteria ;
Result Reporting
let report = evaluator.evaluate_file.await?;
// Summary
println!;
println!;
println!;
println!;
// Detailed failures
for result in report.failures
// Export to JSON
let json = report.to_json?;
Batch Evaluation
Evaluate multiple test cases in parallel:
let results = evaluator
.evaluate_cases_parallel // 4 concurrent
.await;
Evaluate all test files in a directory:
let reports = evaluator
.evaluate_directory
.await?;
Integration with cargo test
async
Advanced Features
Feature Flags
[]
= { = "1.0", = ["embedding", "ci-helpers", "statistics"] }
| Feature | Dependency | Capability |
|---|---|---|
embedding |
adk-memory |
Embedding-based semantic similarity |
ci-helpers |
quick-xml |
JUnit XML report generation |
statistics |
statrs |
Wilcoxon signed-rank for A/B comparison |
All other features (structured judge, cost tracker, trace analyzer, baselines, annotations, test generator, conversation scorer) work without extra feature flags.
Structured LLM Judge
use StructuredJudge;
let judge = new;
let verdict = judge.judge.await?;
// → StructuredVerdict { score: 0.85, verdict: Partial, reasoning: "..." }
Cost and Latency Tracking
use CostTracker;
let tracker = new;
let cost = tracker.compute_cost; // → Some($0.013)
let metrics = tracker.extract_metrics;
Execution Trace Analysis
use TraceAnalyzer;
let analyzer = new;
let analysis = analyzer.analyze;
println!;
Regression Baselines
use BaselineStore;
let store = new;
store.save?;
let regressions = store.check_regressions?;
JUnit XML (CI Integration)
use JunitReporter; // requires ci-helpers feature
let xml = generate?;
Human Annotation Workflow
use AnnotationStore;
export?;
let = import?;
A/B Agent Comparison
use AbComparator; // requires statistics feature
let comparator = new;
let report = comparator.compare.await?;
Auto-Generated Test Cases
use TestGenerator;
let gen = new;
let cases = gen.generate_from_description.await?;
let cases = gen.generate_from_events?;
Multi-Turn Conversation Metrics
use ConversationScorer;
let scorer = new;
let metrics = scorer.score.await?;
// → ConversationMetrics { context_retention, goal_completion, coherence, topic_drift }
CLI
License
Apache-2.0
Part of ADK-Rust
This crate is part of the ADK-Rust framework for building AI agents in Rust.