poe2-agent 0.5.0

# Agent Eval System

Spec for measuring whether the PoE2 build agent uses its tools well — both individually and as a set.

## Problem

We keep adding query tools (9 today, 13+ planned) with no systematic way to answer:

- Does the agent call the **right** tools for a given question?
- Does it call **unnecessary** tools (wasting latency and tokens)?
- Does it **miss** tools it should have called?
- How many LLM round-trips does it take (each costs 1-3s)?
- How many tokens flow through the context window per question?
- Does adding a new tool degrade selection accuracy for existing questions?

The existing `profiling.rs` measures wall-clock time for a single hardcoded question. It tells us PoB isn't the bottleneck but says nothing about agent *behavior*.

## What we measure

### Per-question trace

Every eval run captures a **Trace** for each question:

| Field | Source | What it tells us |
|---|---|---|
| `tool_calls: Vec<(name, args, order)>` | AgentEvent stream | Which tools were called and in what sequence |
| `tool_rounds` | Count of loop iterations | How many LLM↔tool round-trips occurred |
| `tool_response_sizes: Vec<(name, bytes)>` | Tool execution output | How fat each tool response is in the context |
| `prompt_tokens` | OpenAI `usage` field | Total input tokens across all LLM calls |
| `completion_tokens` | OpenAI `usage` field | Total output tokens across all LLM calls |
| `total_tokens` | Sum | Full token cost of the question |
| `time_to_first_token` | Wall clock | Latency before the user sees anything |
| `total_time` | Wall clock | End-to-end latency |
| `final_answer` | Collected token stream | The text the user actually sees |

### Aggregated across the eval suite

| Metric | Definition |
|---|---|
| **Tool selection accuracy** | % of cases where called tools ⊇ expected AND called tools ∩ banned = ∅ |
| **Tool efficiency** | % of cases within max_tool_rounds budget |
| **Answer correctness** | % of cases where all required facts appear in the answer |
| **Avg tokens per question** | Mean total_tokens across all cases |
| **Avg latency** | Mean total_time across all cases |
| **Unnecessary call rate** | Mean count of tools called that weren't in expected set |

## Architecture

Three layers, all in the test crate — no changes to the public library API.

```
┌─────────────────────────────────┐
│         Eval Runner             │  tests/eval.rs
│  Iterates cases, collects       │
│  traces, scores, reports        │
├─────────────────────────────────┤
│         Trace Capture           │  Wraps ToolAgent::respond()
│  Consumes AgentEvent stream,    │  stream, measures everything
│  records tool calls, tokens,    │
│  timing, answer text            │
├─────────────────────────────────┤
│         Eval Cases              │  tests/eval_cases.rs or inline
│  Question + expected behavior   │
│  + scoring thresholds           │
└─────────────────────────────────┘
```

### Trace capture

The trace capture layer consumes the `AgentEvent` stream from `ToolAgent::respond()` and records everything. This works without modifying the agent internals — we already get `ToolCall { name }` and `Token(text)` events.

For token counts, `create_response()` returns `Usage` in `ApiResponse.usage`. For `create_response_stream()`, usage arrives in the `response.completed` SSE event and is stored in an `Arc<OnceLock<Usage>>` companion that the caller reads after the stream completes. The agent loop accumulates usage per question and exposes it via `AgentEvent::Usage(Usage)` emitted at the end of each response. This keeps usage flow explicit in the type system — no hidden mutable state, no reset-between-questions footgun.

For tool response sizes, the trace layer can measure the JSON output of each tool call. This requires a small hook — either:
- Expose `execute_tool` results to the trace layer (needs a richer `AgentEvent` variant like `ToolResult { name, size_bytes }`), or
- Measure tool responses independently by running the same queries through `PobParser` (wasteful), or
- Accept that we know tool response sizes from `profiling.rs` already and skip per-eval-run measurement.

**Recommendation**: Add `AgentEvent::ToolResult { name, size_bytes }` as a new event variant. It's a backward-compatible addition (callers matching on `AgentEvent` already need a wildcard or will get a compiler warning). This is the only library-side change.

### Eval case format

```rust
struct EvalCase {
    /// Human-readable name for the report.
    name: &'static str,

    /// The question to ask the agent.
    question: &'static str,

    /// Tools that SHOULD be called. The agent may call them in any order.
    /// Scoring: each missing expected tool is a penalty.
    expected_tools: &'static [&'static str],

    /// Tools that MUST NOT be called. Each banned tool called is a penalty.
    banned_tools: &'static [&'static str],

    /// Maximum acceptable tool-calling rounds (LLM round-trips, not
    /// individual tool calls — one round can batch multiple calls).
    max_tool_rounds: usize,

    /// Substrings or patterns that MUST appear in the final answer.
    /// Case-insensitive matching. Use for factual assertions.
    answer_must_contain: &'static [&'static str],

    /// Maximum total tokens (prompt + completion) across all LLM calls.
    /// Acts as a budget ceiling — exceeding this is a warning, not a failure.
    /// Set generously at first, tighten as you learn typical costs.
    max_total_tokens: Option<usize>,
}
```

### Scoring

Each case produces a `CaseResult`:

```rust
struct CaseResult {
    name: String,
    // Tool selection
    expected_tools_called: Vec<String>,    // ✓ expected and present
    expected_tools_missed: Vec<String>,    // ✗ expected but absent
    banned_tools_called: Vec<String>,      // ✗ banned but called
    extra_tools_called: Vec<String>,       // ~ not expected, not banned

    // Efficiency
    tool_rounds: usize,
    rounds_over_budget: bool,              // tool_rounds > max_tool_rounds

    // Correctness
    facts_found: Vec<String>,             // ✓ pattern found in answer
    facts_missing: Vec<String>,           // ✗ pattern not found

    // Cost
    total_tokens: usize,
    tokens_over_budget: bool,

    // Timing
    total_time: Duration,
    time_to_first_token: Duration,
}
```

A case **passes** when:
- All expected tools were called
- No banned tools were called
- Tool rounds ≤ budget
- All answer facts found

A case **warns** when:
- Extra (unexpected, unbanned) tools were called
- Token budget exceeded

A case **fails** when any pass condition is violated.

## Required code changes

### 1. Token usage tracking (`src/llm.rs`) — done

Usage is already tracked:

- `create_response()` returns `ApiResponse` with `usage: Option<Usage>` containing `input_tokens`, `output_tokens`, `total_tokens`
- `create_response_stream()` returns an `Arc<OnceLock<Usage>>` companion that gets populated from the `response.completed` SSE event

### 2. Tool result size tracking (`src/agent.rs`) — done

`AgentEvent` already has all required variants:

```rust
pub enum AgentEvent {
    ToolCall { name: String },
    ToolResult { name: String, size_bytes: usize },
    Token(String),
    Usage(Usage),
}
```

The agent loop accumulates `Usage` from each `create_response()` call and from the `create_response_stream()` companion `OnceLock`. After the stream completes, it yields the total via `AgentEvent::Usage(cumulative_usage)`.

### 3. Eval test file (`tests/eval.rs`) — new file

This is the bulk of the work. Contains:
- The `EvalCase` and `CaseResult` structs
- The trace capture function
- The eval case definitions
- The scoring logic
- The report printer

## Initial eval suite

These cases cover the current 9 tools across different question types. The expected/banned tools encode our understanding of *correct* agent behavior — this is the opinionated part.

### Category 1: Simple stat lookups

```
Case: "basic_dps"
  Q: "What is my total DPS and what main skill am I using?"
  Expected: [get_build_stats, get_skill_list]
  Banned: [get_item, get_passive_tree, get_jewel, query_passive_stats]
  Max rounds: 2
  Answer contains: ["DPS"]

Case: "defensive_stats"
  Q: "How tanky is this build? What are my defenses?"
  Expected: [get_build_stats]
  Banned: [get_item, get_jewel]
  Max rounds: 2
  Answer contains: ["life" or "energy shield", "armour" or "evasion"]

Case: "skill_gems"
  Q: "What support gems are linked to my main skill?"
  Expected: [get_skill_list]
  Banned: [get_item, get_passive_tree, get_jewel]
  Max rounds: 2
  Answer contains: []  // hard to assert specific gem names without knowing fixture
```

### Category 2: Gear inspection

```
Case: "specific_item"
  Q: "What weapon am I using?"
  Expected: [get_item]
  Banned: [get_passive_tree, get_jewel, query_passive_stats, get_unallocated_ascendancy]
  Max rounds: 2
  Answer contains: ["Weapon"]

Case: "missing_gear"
  Q: "Am I missing any gear? What slots are empty?"
  Expected: [get_empty_slots]
  Banned: [get_passive_tree, get_jewel, query_passive_stats]
  Max rounds: 2
  Answer contains: []

Case: "gear_overview_then_detail"
  Q: "What's my worst piece of gear and how could I upgrade it?"
  Expected: [get_empty_slots]  // should scan first
  Banned: [get_jewel, query_passive_stats]
  Max rounds: 4  // may need multiple get_item calls
  Answer contains: ["upgrade"]
```

### Category 3: Passive tree

```
Case: "keystones"
  Q: "What keystones am I using?"
  Expected: [get_passive_tree]
  Banned: [get_item, get_config, get_jewel]
  Max rounds: 2
  Answer contains: []

Case: "jewel_inspection"
  Q: "What jewels do I have socketed in my passive tree?"
  Expected: [get_passive_tree, get_jewel]
  Banned: [get_item, get_config, query_passive_stats]
  Max rounds: 3  // tree first, then jewel calls
  Answer contains: []

Case: "stat_sourcing"
  Q: "How much fire damage am I getting from the passive tree, and is there more nearby?"
  Expected: [query_passive_stats]
  Banned: [get_item, get_jewel, get_config]
  Max rounds: 2
  Answer contains: ["fire damage"]
```

### Category 4: Ascendancy

```
Case: "ascendancy_recommendation"
  Q: "What ascendancy nodes should I take next?"
  Expected: [get_unallocated_ascendancy]
  Banned: [get_item, get_jewel]
  Max rounds: 2
  Answer contains: []

Case: "ascendancy_current"
  Q: "What ascendancy am I playing and what nodes do I have?"
  Expected: [get_unallocated_ascendancy]
  Banned: [get_item, get_config]
  Max rounds: 2
  Answer contains: []
```

### Category 5: Config and setup

```
Case: "build_config"
  Q: "What enemy level is my build configured for?"
  Expected: [get_config]
  Banned: [get_item, get_passive_tree, get_jewel]
  Max rounds: 2
  Answer contains: []
```

### Category 6: Multi-tool complex questions

```
Case: "full_build_review"
  Q: "Give me a quick overview of this build — what's working and what needs improvement?"
  Expected: [get_build_stats]
  Banned: []  // agent has latitude here
  Max rounds: 5
  Answer contains: []
  Max tokens: 8000

Case: "upgrade_priorities"
  Q: "What are the top 3 things I should upgrade on this build?"
  Expected: [get_build_stats]
  Banned: []
  Max rounds: 5
  Answer contains: []
  Max tokens: 10000
```

### Category 7: Edge cases

```
Case: "no_tools_needed"
  Q: "What is Path of Exile 2?"
  Expected: []  // should answer from knowledge, no tools
  Banned: []    // calling tools isn't wrong, just wasteful
  Max rounds: 1
  Answer contains: ["Path of Exile"]

Case: "ambiguous_item_slot"
  Q: "Show me my ring"
  Expected: [get_item]  // should pick a ring slot
  Banned: [get_passive_tree, get_jewel]
  Max rounds: 2
  Answer contains: []
```

## Report format

The eval prints a scorecard to stderr (like profiling.rs):

```
=== Agent Eval Report (model: gpt-4.1-nano) ===
  Fixture: ranger-with-gear.xml
  Cases:   15
  Passed:  12 (80%)
  Warned:  2
  Failed:  1

  --- Per-case results ---

  ✓ basic_dps                    2 rounds   1847 tok   3.2s
    tools: get_build_stats, get_skill_list
  ✓ defensive_stats              1 round    1203 tok   2.1s
    tools: get_build_stats
  ✗ specific_item                3 rounds   2891 tok   5.4s
    tools: get_build_stats(!), get_item
    FAIL: unexpected tool get_build_stats (not banned, but extra)
    FAIL: 3 rounds > max 2
  ~ gear_overview_then_detail    4 rounds   4102 tok   7.8s
    tools: get_empty_slots, get_item, get_item, get_item
    WARN: token budget exceeded (4102 > 4000)

  --- Aggregate ---

  Tool selection accuracy:  87%  (13/15 cases all expected tools called)
  No-banned-tool rate:     100%  (0/15 cases called a banned tool)
  Efficiency rate:          93%  (14/15 within round budget)
  Answer correctness:       80%  (12/15 all required facts present)
  Avg total tokens:        2340
  Avg latency:             3.8s
  Unnecessary call rate:    0.3 tools/question
```

## Running the eval

```bash
# Full suite (requires API key, costs real money)
OPENAI_API_KEY=sk-... cargo test --test eval -- --nocapture

# Specific case
cargo test --test eval basic_dps -- --nocapture

# With a different model (compare behavior across models)
OPENAI_MODEL=gpt-5-mini cargo test --test eval -- --nocapture

# Quick subset for CI (pick 3-4 fast representative cases)
cargo test --test eval -- --nocapture quick
```

Expect the full suite to take 30-90 seconds and cost ~$0.01-0.05 depending on model.

## Gating new tools

Before adding a new tool to the agent:

1. **Run the eval suite** → record baseline scorecard
2. **Add the tool** (definition + execute_tool dispatch + system prompt guidance)
3. **Add 2-3 eval cases** for the new tool's intended use cases
4. **Run the eval suite again** → compare:
   - Did any *existing* cases regress? (Tool selection accuracy dropped, extra tools appearing)
   - Do the new cases pass? (Agent actually uses the tool correctly)
   - Did aggregate token cost increase significantly?

If existing cases regress, the tool description or system prompt needs work before merging. This gives you a principled answer to "should I add this tool?" instead of vibes.

## Future extensions

Things explicitly out of scope for v1 but worth considering later:

- **Multiple fixtures**: Run the same eval cases against different builds (melee, caster, minion) to test generalization. The fixture is just a parameter.
- **LLM-as-judge for answer quality**: Instead of substring matching, use a cheap model to score "did the answer correctly use the data from the tools?" This catches cases where the agent calls the right tools but misinterprets the results.
- **Regression tracking**: Save eval results to a JSON file per run, compare across runs automatically. Plot trends over time.
- **Conversation eval**: Multi-turn cases where the second question builds on the first. Tests whether the agent avoids redundant tool calls when it already has data.
- **A/B on system prompt changes**: Run eval before/after a prompt edit to quantify impact.