# Agent Eval System
Spec for measuring whether the PoE2 build agent uses its tools well — both individually and as a set.
## Problem
We keep adding query tools (9 today, 13+ planned) with no systematic way to answer:
- Does the agent call the **right** tools for a given question?
- Does it call **unnecessary** tools (wasting latency and tokens)?
- Does it **miss** tools it should have called?
- How many LLM round-trips does it take (each costs 1-3s)?
- How many tokens flow through the context window per question?
- Does adding a new tool degrade selection accuracy for existing questions?
The existing `profiling.rs` measures wall-clock time for a single hardcoded question. It tells us PoB isn't the bottleneck but says nothing about agent *behavior*.
## What we measure
### Per-question trace
Every eval run captures a **Trace** for each question:
| `tool_calls: Vec<(name, args, order)>` | AgentEvent stream | Which tools were called and in what sequence |
| `tool_rounds` | Count of loop iterations | How many LLM↔tool round-trips occurred |
| `tool_response_sizes: Vec<(name, bytes)>` | Tool execution output | How fat each tool response is in the context |
| `prompt_tokens` | OpenAI `usage` field | Total input tokens across all LLM calls |
| `completion_tokens` | OpenAI `usage` field | Total output tokens across all LLM calls |
| `total_tokens` | Sum | Full token cost of the question |
| `time_to_first_token` | Wall clock | Latency before the user sees anything |
| `total_time` | Wall clock | End-to-end latency |
| `final_answer` | Collected token stream | The text the user actually sees |
### Aggregated across the eval suite
| **Tool selection accuracy** | % of cases where called tools ⊇ expected AND called tools ∩ banned = ∅ |
| **Tool efficiency** | % of cases within max_tool_rounds budget |
| **Answer correctness** | % of cases where all required facts appear in the answer |
| **Avg tokens per question** | Mean total_tokens across all cases |
| **Avg latency** | Mean total_time across all cases |
| **Unnecessary call rate** | Mean count of tools called that weren't in expected set |
## Architecture
Three layers, all in the test crate — no changes to the public library API.
```
┌─────────────────────────────────┐
│ Eval Runner │ tests/eval.rs
│ Iterates cases, collects │
│ traces, scores, reports │
├─────────────────────────────────┤
│ Trace Capture │ Wraps ToolAgent::respond()
│ Consumes AgentEvent stream, │ stream, measures everything
│ records tool calls, tokens, │
│ timing, answer text │
├─────────────────────────────────┤
│ Eval Cases │ tests/eval_cases.rs or inline
│ Question + expected behavior │
│ + scoring thresholds │
└─────────────────────────────────┘
```
### Trace capture
The trace capture layer consumes the `AgentEvent` stream from `ToolAgent::respond()` and records everything. This works without modifying the agent internals — we already get `ToolCall { name }` and `Token(text)` events.
For token counts, `create_response()` returns `Usage` in `ApiResponse.usage`. For `create_response_stream()`, usage arrives in the `response.completed` SSE event and is stored in an `Arc<OnceLock<Usage>>` companion that the caller reads after the stream completes. The agent loop accumulates usage per question and exposes it via `AgentEvent::Usage(Usage)` emitted at the end of each response. This keeps usage flow explicit in the type system — no hidden mutable state, no reset-between-questions footgun.
For tool response sizes, the trace layer can measure the JSON output of each tool call. This requires a small hook — either:
- Expose `execute_tool` results to the trace layer (needs a richer `AgentEvent` variant like `ToolResult { name, size_bytes }`), or
- Measure tool responses independently by running the same queries through `PobParser` (wasteful), or
- Accept that we know tool response sizes from `profiling.rs` already and skip per-eval-run measurement.
**Recommendation**: Add `AgentEvent::ToolResult { name, size_bytes }` as a new event variant. It's a backward-compatible addition (callers matching on `AgentEvent` already need a wildcard or will get a compiler warning). This is the only library-side change.
### Eval case format
```rust
struct EvalCase {
/// Human-readable name for the report.
name: &'static str,
/// The question to ask the agent.
question: &'static str,
/// Tools that SHOULD be called. The agent may call them in any order.
/// Scoring: each missing expected tool is a penalty.
expected_tools: &'static [&'static str],
/// Tools that MUST NOT be called. Each banned tool called is a penalty.
banned_tools: &'static [&'static str],
/// Maximum acceptable tool-calling rounds (LLM round-trips, not
/// individual tool calls — one round can batch multiple calls).
max_tool_rounds: usize,
/// Substrings or patterns that MUST appear in the final answer.
/// Case-insensitive matching. Use for factual assertions.
answer_must_contain: &'static [&'static str],
/// Maximum total tokens (prompt + completion) across all LLM calls.
/// Acts as a budget ceiling — exceeding this is a warning, not a failure.
/// Set generously at first, tighten as you learn typical costs.
max_total_tokens: Option<usize>,
}
```
### Scoring
Each case produces a `CaseResult`:
```rust
struct CaseResult {
name: String,
// Tool selection
expected_tools_called: Vec<String>, // ✓ expected and present
expected_tools_missed: Vec<String>, // ✗ expected but absent
banned_tools_called: Vec<String>, // ✗ banned but called
extra_tools_called: Vec<String>, // ~ not expected, not banned
// Efficiency
tool_rounds: usize,
rounds_over_budget: bool, // tool_rounds > max_tool_rounds
// Correctness
facts_found: Vec<String>, // ✓ pattern found in answer
facts_missing: Vec<String>, // ✗ pattern not found
// Cost
total_tokens: usize,
tokens_over_budget: bool,
// Timing
total_time: Duration,
time_to_first_token: Duration,
}
```
A case **passes** when:
- All expected tools were called
- No banned tools were called
- Tool rounds ≤ budget
- All answer facts found
A case **warns** when:
- Extra (unexpected, unbanned) tools were called
- Token budget exceeded
A case **fails** when any pass condition is violated.
## Required code changes
### 1. Token usage tracking (`src/llm.rs`) — done
Usage is already tracked:
- `create_response()` returns `ApiResponse` with `usage: Option<Usage>` containing `input_tokens`, `output_tokens`, `total_tokens`
- `create_response_stream()` returns an `Arc<OnceLock<Usage>>` companion that gets populated from the `response.completed` SSE event
### 2. Tool result size tracking (`src/agent.rs`) — done
`AgentEvent` already has all required variants:
```rust
pub enum AgentEvent {
ToolCall { name: String },
ToolResult { name: String, size_bytes: usize },
Token(String),
Usage(Usage),
}
```
The agent loop accumulates `Usage` from each `create_response()` call and from the `create_response_stream()` companion `OnceLock`. After the stream completes, it yields the total via `AgentEvent::Usage(cumulative_usage)`.
### 3. Eval test file (`tests/eval.rs`) — new file
This is the bulk of the work. Contains:
- The `EvalCase` and `CaseResult` structs
- The trace capture function
- The eval case definitions
- The scoring logic
- The report printer
## Initial eval suite
These cases cover the current 9 tools across different question types. The expected/banned tools encode our understanding of *correct* agent behavior — this is the opinionated part.
### Category 1: Simple stat lookups
```
Case: "basic_dps"
Q: "What is my total DPS and what main skill am I using?"
Expected: [get_build_stats, get_skill_list]
Banned: [get_item, get_passive_tree, get_jewel, query_passive_stats]
Max rounds: 2
Answer contains: ["DPS"]
Case: "defensive_stats"
Q: "How tanky is this build? What are my defenses?"
Expected: [get_build_stats]
Banned: [get_item, get_jewel]
Max rounds: 2
Answer contains: ["life" or "energy shield", "armour" or "evasion"]
Case: "skill_gems"
Q: "What support gems are linked to my main skill?"
Expected: [get_skill_list]
Banned: [get_item, get_passive_tree, get_jewel]
Max rounds: 2
Answer contains: [] // hard to assert specific gem names without knowing fixture
```
### Category 2: Gear inspection
```
Case: "specific_item"
Q: "What weapon am I using?"
Expected: [get_item]
Banned: [get_passive_tree, get_jewel, query_passive_stats, get_unallocated_ascendancy]
Max rounds: 2
Answer contains: ["Weapon"]
Case: "missing_gear"
Q: "Am I missing any gear? What slots are empty?"
Expected: [get_empty_slots]
Banned: [get_passive_tree, get_jewel, query_passive_stats]
Max rounds: 2
Answer contains: []
Case: "gear_overview_then_detail"
Q: "What's my worst piece of gear and how could I upgrade it?"
Expected: [get_empty_slots] // should scan first
Banned: [get_jewel, query_passive_stats]
Max rounds: 4 // may need multiple get_item calls
Answer contains: ["upgrade"]
```
### Category 3: Passive tree
```
Case: "keystones"
Q: "What keystones am I using?"
Expected: [get_passive_tree]
Banned: [get_item, get_config, get_jewel]
Max rounds: 2
Answer contains: []
Case: "jewel_inspection"
Q: "What jewels do I have socketed in my passive tree?"
Expected: [get_passive_tree, get_jewel]
Banned: [get_item, get_config, query_passive_stats]
Max rounds: 3 // tree first, then jewel calls
Answer contains: []
Case: "stat_sourcing"
Q: "How much fire damage am I getting from the passive tree, and is there more nearby?"
Expected: [query_passive_stats]
Banned: [get_item, get_jewel, get_config]
Max rounds: 2
Answer contains: ["fire damage"]
```
### Category 4: Ascendancy
```
Case: "ascendancy_recommendation"
Q: "What ascendancy nodes should I take next?"
Expected: [get_unallocated_ascendancy]
Banned: [get_item, get_jewel]
Max rounds: 2
Answer contains: []
Case: "ascendancy_current"
Q: "What ascendancy am I playing and what nodes do I have?"
Expected: [get_unallocated_ascendancy]
Banned: [get_item, get_config]
Max rounds: 2
Answer contains: []
```
### Category 5: Config and setup
```
Case: "build_config"
Q: "What enemy level is my build configured for?"
Expected: [get_config]
Banned: [get_item, get_passive_tree, get_jewel]
Max rounds: 2
Answer contains: []
```
### Category 6: Multi-tool complex questions
```
Case: "full_build_review"
Q: "Give me a quick overview of this build — what's working and what needs improvement?"
Expected: [get_build_stats]
Banned: [] // agent has latitude here
Max rounds: 5
Answer contains: []
Max tokens: 8000
Case: "upgrade_priorities"
Q: "What are the top 3 things I should upgrade on this build?"
Expected: [get_build_stats]
Banned: []
Max rounds: 5
Answer contains: []
Max tokens: 10000
```
### Category 7: Edge cases
```
Case: "no_tools_needed"
Q: "What is Path of Exile 2?"
Expected: [] // should answer from knowledge, no tools
Banned: [] // calling tools isn't wrong, just wasteful
Max rounds: 1
Answer contains: ["Path of Exile"]
Case: "ambiguous_item_slot"
Q: "Show me my ring"
Expected: [get_item] // should pick a ring slot
Banned: [get_passive_tree, get_jewel]
Max rounds: 2
Answer contains: []
```
## Report format
The eval prints a scorecard to stderr (like profiling.rs):
```
=== Agent Eval Report (model: gpt-4.1-nano) ===
Fixture: ranger-with-gear.xml
Cases: 15
Passed: 12 (80%)
Warned: 2
Failed: 1
--- Per-case results ---
✓ basic_dps 2 rounds 1847 tok 3.2s
tools: get_build_stats, get_skill_list
✓ defensive_stats 1 round 1203 tok 2.1s
tools: get_build_stats
✗ specific_item 3 rounds 2891 tok 5.4s
tools: get_build_stats(!), get_item
FAIL: unexpected tool get_build_stats (not banned, but extra)
FAIL: 3 rounds > max 2
~ gear_overview_then_detail 4 rounds 4102 tok 7.8s
tools: get_empty_slots, get_item, get_item, get_item
WARN: token budget exceeded (4102 > 4000)
--- Aggregate ---
Tool selection accuracy: 87% (13/15 cases all expected tools called)
No-banned-tool rate: 100% (0/15 cases called a banned tool)
Efficiency rate: 93% (14/15 within round budget)
Answer correctness: 80% (12/15 all required facts present)
Avg total tokens: 2340
Avg latency: 3.8s
Unnecessary call rate: 0.3 tools/question
```
## Running the eval
```bash
# Full suite (requires API key, costs real money)
OPENAI_API_KEY=sk-... cargo test --test eval -- --nocapture
# Specific case
cargo test --test eval basic_dps -- --nocapture
# With a different model (compare behavior across models)
OPENAI_MODEL=gpt-5-mini cargo test --test eval -- --nocapture
# Quick subset for CI (pick 3-4 fast representative cases)
cargo test --test eval -- --nocapture quick
```
Expect the full suite to take 30-90 seconds and cost ~$0.01-0.05 depending on model.
## Gating new tools
Before adding a new tool to the agent:
1. **Run the eval suite** → record baseline scorecard
2. **Add the tool** (definition + execute_tool dispatch + system prompt guidance)
3. **Add 2-3 eval cases** for the new tool's intended use cases
4. **Run the eval suite again** → compare:
- Did any *existing* cases regress? (Tool selection accuracy dropped, extra tools appearing)
- Do the new cases pass? (Agent actually uses the tool correctly)
- Did aggregate token cost increase significantly?
If existing cases regress, the tool description or system prompt needs work before merging. This gives you a principled answer to "should I add this tool?" instead of vibes.
## Future extensions
Things explicitly out of scope for v1 but worth considering later:
- **Multiple fixtures**: Run the same eval cases against different builds (melee, caster, minion) to test generalization. The fixture is just a parameter.
- **LLM-as-judge for answer quality**: Instead of substring matching, use a cheap model to score "did the answer correctly use the data from the tools?" This catches cases where the agent calls the right tools but misinterprets the results.
- **Regression tracking**: Save eval results to a JSON file per run, compare across runs automatically. Plot trends over time.
- **Conversation eval**: Multi-turn cases where the second question builds on the first. Tests whether the agent avoids redundant tool calls when it already has data.
- **A/B on system prompt changes**: Run eval before/after a prompt edit to quantify impact.