1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
//! Skill / agent evaluation primitives (feature `skills`).
//!
//! While [`crate::harness::RetrievalHarness`] scores a vector store against
//! a labelled query set, the [`SkillHarness`] in this module scores an
//! *agent* (or a single skill loaded into one) against a labelled task set.
//!
//! The shape follows the vocabulary established by Anthropic's
//! [Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
//! and OpenAI's
//! [Testing Agent Skills Systematically with Evals](https://developers.openai.com/blog/eval-skills):
//!
//! - A [`SkillTask`] is one labelled prompt with a `should_trigger` flag and
//! a set of grader ids to run against the resulting transcript.
//! - A [`Transcript`] is the captured output of a single trial: final text,
//! tool calls, token usage, elapsed time, and an optional skill-selection
//! marker.
//! - A [`Grader`] is a deterministic check over a [`Transcript`]. Concrete
//! graders ship in this module (see [`ContainsGrader`], [`ToolCallGrader`],
//! [`TranscriptBudget`], [`TriggerGrader`]).
//! - An [`AgentRunner`] is user-supplied: it owns whatever agent / harness
//! you want to evaluate, and returns one [`Transcript`] per `(task, trial)`.
//! - [`SkillHarness`] drives the matrix `tasks × trials`, applies every
//! grader, and aggregates results into a [`SkillEvalReport`] that reuses
//! the existing [`MetricReport`](crate::report::MetricReport) and
//! [`ReliabilityReport`](crate::report::ReliabilityReport) infrastructure.
//!
//! ## Scope
//!
//! Phase 1 is deterministic-only. LLM-rubric judging is intentionally out of
//! scope for this module — pair the existing [`crate::ragas`] judges with a
//! custom [`Grader`] impl if you need it today.
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use AgentRunner;
pub use ;
pub use ;