pub struct EvalRunner { /* private fields */ }Expand description
Runs evaluation cases against an agent and collects scored results.
The runner wires an OnEvent callback to capture the tool call trajectory,
then scores each case using the configured scorers.
Implementations§
Source§impl EvalRunner
impl EvalRunner
Sourcepub fn new() -> Self
pub fn new() -> Self
Create a new eval runner with no scorers.
Add scorers with scorer; each runs against every
case’s actual output to produce a ScorerResult.
§Example
use heartbit_core::eval::{EvalCase, EvalRunner, KeywordScorer};
let runner = EvalRunner::new().scorer(KeywordScorer);
let case = EvalCase::new("capital", "What is the capital of France?")
.expect_output_contains("Paris");
// No real LLM call here — score the "actual output" directly.
let result = runner.score_result(&case, "The capital of France is Paris.", &[], None);
assert!(result.passed);Sourcepub fn scorer(self, scorer: impl EvalScorer + 'static) -> Self
pub fn scorer(self, scorer: impl EvalScorer + 'static) -> Self
Add a scorer to the runner.
Sourcepub fn with_event_collector(self, collector: EventCollector) -> Self
pub fn with_event_collector(self, collector: EventCollector) -> Self
Attach an event collector that EvalRunner::run will clear before
each case. This is required when running 2+ cases with event-aware
scorers (CostScorer, LatencyScorer, SafetyScorer) against
the same collector — without it, events accumulate across cases and
make per-case budgets incorrect from the second case onward.
Pass the same collector you wired into the agent via
EvalRunner::event_callback / build_eval_agent.
Sourcepub async fn run<P: LlmProvider>(
&self,
agent: &AgentRunner<P>,
cases: &[EvalCase],
) -> Vec<EvalResult>
pub async fn run<P: LlmProvider>( &self, agent: &AgentRunner<P>, cases: &[EvalCase], ) -> Vec<EvalResult>
Run all eval cases against an agent, returning results.
Each case runs the agent independently (fresh execution per case). When
an event collector is attached via EvalRunner::with_event_collector,
it is cleared before each case so event-aware scorers see only the
events generated by that case.
Limitation: This method cannot capture tool call trajectory data
because the agent’s OnEvent callback is set at build time. For
trajectory scoring, build the agent with build_eval_agent and use
score_result with the collected events.
Sourcepub fn score_result(
&self,
case: &EvalCase,
output: &str,
tool_calls: &[String],
error: Option<String>,
) -> EvalResult
pub fn score_result( &self, case: &EvalCase, output: &str, tool_calls: &[String], error: Option<String>, ) -> EvalResult
Score a case result with pre-collected tool calls.
Use this when you have tool call data from an external source
(e.g., OnEvent callback, audit trail, or manual testing).
Sourcepub fn event_collector() -> EventCollector
pub fn event_collector() -> EventCollector
Create an event collector callback for capturing tool call trajectory.
Wire this into AgentRunnerBuilder::on_event() before building the agent.
After execution, call collected_tool_calls() on the returned vec.
Sourcepub fn event_callback(
collector: &EventCollector,
) -> Arc<dyn Fn(AgentEvent) + Send + Sync> ⓘ
pub fn event_callback( collector: &EventCollector, ) -> Arc<dyn Fn(AgentEvent) + Send + Sync> ⓘ
Build an OnEvent callback that pushes events into the collector.
Sourcepub fn collected_tool_calls(collector: &EventCollector) -> Vec<String>
pub fn collected_tool_calls(collector: &EventCollector) -> Vec<String>
Extract tool call names from a collected event vec.