Expand description
Agent evaluation framework.
Provides tools for measuring agent quality through repeatable test cases. Inspired by Google ADK’s built-in eval: tool trajectory comparison, response quality scoring, and composable scorers.
§Quick start
ⓘ
use heartbit::eval::{EvalCase, EvalRunner, KeywordScorer, TrajectoryScorer};
let cases = vec![
EvalCase::new("greeting", "Say hello")
.expect_output_contains("hello")
.expect_no_tools(),
EvalCase::new("file-read", "Read /tmp/test.txt")
.expect_tool("read_file")
.expect_output_contains("content"),
];
let runner = EvalRunner::new()
.scorer(TrajectoryScorer)
.scorer(KeywordScorer);
let results = runner.run(&agent, &cases).await;
let summary = EvalSummary::from_results(&results);
println!("{summary}");Structs§
- Case
Comparison - Comparison of a single case between baseline and candidate runs.
- Cost
Scorer - Scores agent execution against a cost budget.
- Eval
Case - A single evaluation test case.
- Eval
Comparison - Comparison of two eval runs for A/B testing and regression detection.
- Eval
Result - Result of evaluating a single test case.
- Eval
Runner - Runs evaluation cases against an agent and collects scored results.
- Eval
Summary - Aggregate summary of multiple eval results.
- Expected
Tool Call - An expected tool call in a trajectory.
- Keyword
Scorer - Scores output against expected keyword presence/absence.
- Latency
Scorer - Scores agent execution against a latency budget.
- Safety
Scorer - Scores agent execution for guardrail safety.
- Scorer
Result - Result from a single scorer.
- Similarity
Scorer - Scores output similarity to a reference using unigram overlap (Rouge-1 F1).
- Tool
Call Count Scorer - Scores agent execution against a tool call count budget.
- Trajectory
Scorer - Scores tool call trajectory against expected tool calls.
Constants§
- KNOWN_
SCORERS - All known scorer names supported by the eval framework.
Traits§
- Eval
Scorer - Pluggable scoring function for evaluation.
Functions§
- build_
eval_ agent - Build an eval-ready agent with event collection.
- clear_
events - Clear all events from a collector.
Type Aliases§
- Event
Collector - Shared event collector for eval tool call trajectory capture.