poe2-agent 0.5.0

AI agent for Path of Exile 2 build analysis
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
# Agent Eval System

Spec for measuring whether the PoE2 build agent uses its tools well — both individually and as a set.

## Problem

We keep adding query tools (9 today, 13+ planned) with no systematic way to answer:

- Does the agent call the **right** tools for a given question?
- Does it call **unnecessary** tools (wasting latency and tokens)?
- Does it **miss** tools it should have called?
- How many LLM round-trips does it take (each costs 1-3s)?
- How many tokens flow through the context window per question?
- Does adding a new tool degrade selection accuracy for existing questions?

The existing `profiling.rs` measures wall-clock time for a single hardcoded question. It tells us PoB isn't the bottleneck but says nothing about agent *behavior*.

## What we measure

### Per-question trace

Every eval run captures a **Trace** for each question:

| Field | Source | What it tells us |
|---|---|---|
| `tool_calls: Vec<(name, args, order)>` | AgentEvent stream | Which tools were called and in what sequence |
| `tool_rounds` | Count of loop iterations | How many LLM↔tool round-trips occurred |
| `tool_response_sizes: Vec<(name, bytes)>` | Tool execution output | How fat each tool response is in the context |
| `prompt_tokens` | OpenAI `usage` field | Total input tokens across all LLM calls |
| `completion_tokens` | OpenAI `usage` field | Total output tokens across all LLM calls |
| `total_tokens` | Sum | Full token cost of the question |
| `time_to_first_token` | Wall clock | Latency before the user sees anything |
| `total_time` | Wall clock | End-to-end latency |
| `final_answer` | Collected token stream | The text the user actually sees |

### Aggregated across the eval suite

| Metric | Definition |
|---|---|
| **Tool selection accuracy** | % of cases where called tools ⊇ expected AND called tools ∩ banned = ∅ |
| **Tool efficiency** | % of cases within max_tool_rounds budget |
| **Answer correctness** | % of cases where all required facts appear in the answer |
| **Avg tokens per question** | Mean total_tokens across all cases |
| **Avg latency** | Mean total_time across all cases |
| **Unnecessary call rate** | Mean count of tools called that weren't in expected set |

## Architecture

Three layers, all in the test crate — no changes to the public library API.

```
┌─────────────────────────────────┐
│         Eval Runner             │  tests/eval.rs
│  Iterates cases, collects       │
│  traces, scores, reports        │
├─────────────────────────────────┤
│         Trace Capture           │  Wraps ToolAgent::respond()
│  Consumes AgentEvent stream,    │  stream, measures everything
│  records tool calls, tokens,    │
│  timing, answer text            │
├─────────────────────────────────┤
│         Eval Cases              │  tests/eval_cases.rs or inline
│  Question + expected behavior   │
│  + scoring thresholds           │
└─────────────────────────────────┘
```

### Trace capture

The trace capture layer consumes the `AgentEvent` stream from `ToolAgent::respond()` and records everything. This works without modifying the agent internals — we already get `ToolCall { name }` and `Token(text)` events.

For token counts, `create_response()` returns `Usage` in `ApiResponse.usage`. For `create_response_stream()`, usage arrives in the `response.completed` SSE event and is stored in an `Arc<OnceLock<Usage>>` companion that the caller reads after the stream completes. The agent loop accumulates usage per question and exposes it via `AgentEvent::Usage(Usage)` emitted at the end of each response. This keeps usage flow explicit in the type system — no hidden mutable state, no reset-between-questions footgun.

For tool response sizes, the trace layer can measure the JSON output of each tool call. This requires a small hook — either:
- Expose `execute_tool` results to the trace layer (needs a richer `AgentEvent` variant like `ToolResult { name, size_bytes }`), or
- Measure tool responses independently by running the same queries through `PobParser` (wasteful), or
- Accept that we know tool response sizes from `profiling.rs` already and skip per-eval-run measurement.

**Recommendation**: Add `AgentEvent::ToolResult { name, size_bytes }` as a new event variant. It's a backward-compatible addition (callers matching on `AgentEvent` already need a wildcard or will get a compiler warning). This is the only library-side change.

### Eval case format

```rust
struct EvalCase {
    /// Human-readable name for the report.
    name: &'static str,

    /// The question to ask the agent.
    question: &'static str,

    /// Tools that SHOULD be called. The agent may call them in any order.
    /// Scoring: each missing expected tool is a penalty.
    expected_tools: &'static [&'static str],

    /// Tools that MUST NOT be called. Each banned tool called is a penalty.
    banned_tools: &'static [&'static str],

    /// Maximum acceptable tool-calling rounds (LLM round-trips, not
    /// individual tool calls — one round can batch multiple calls).
    max_tool_rounds: usize,

    /// Substrings or patterns that MUST appear in the final answer.
    /// Case-insensitive matching. Use for factual assertions.
    answer_must_contain: &'static [&'static str],

    /// Maximum total tokens (prompt + completion) across all LLM calls.
    /// Acts as a budget ceiling — exceeding this is a warning, not a failure.
    /// Set generously at first, tighten as you learn typical costs.
    max_total_tokens: Option<usize>,
}
```

### Scoring

Each case produces a `CaseResult`:

```rust
struct CaseResult {
    name: String,
    // Tool selection
    expected_tools_called: Vec<String>,    // ✓ expected and present
    expected_tools_missed: Vec<String>,    // ✗ expected but absent
    banned_tools_called: Vec<String>,      // ✗ banned but called
    extra_tools_called: Vec<String>,       // ~ not expected, not banned

    // Efficiency
    tool_rounds: usize,
    rounds_over_budget: bool,              // tool_rounds > max_tool_rounds

    // Correctness
    facts_found: Vec<String>,             // ✓ pattern found in answer
    facts_missing: Vec<String>,           // ✗ pattern not found

    // Cost
    total_tokens: usize,
    tokens_over_budget: bool,

    // Timing
    total_time: Duration,
    time_to_first_token: Duration,
}
```

A case **passes** when:
- All expected tools were called
- No banned tools were called
- Tool rounds ≤ budget
- All answer facts found

A case **warns** when:
- Extra (unexpected, unbanned) tools were called
- Token budget exceeded

A case **fails** when any pass condition is violated.

## Required code changes

### 1. Token usage tracking (`src/llm.rs`) — done

Usage is already tracked:

- `create_response()` returns `ApiResponse` with `usage: Option<Usage>` containing `input_tokens`, `output_tokens`, `total_tokens`
- `create_response_stream()` returns an `Arc<OnceLock<Usage>>` companion that gets populated from the `response.completed` SSE event

### 2. Tool result size tracking (`src/agent.rs`) — done

`AgentEvent` already has all required variants:

```rust
pub enum AgentEvent {
    ToolCall { name: String },
    ToolResult { name: String, size_bytes: usize },
    Token(String),
    Usage(Usage),
}
```

The agent loop accumulates `Usage` from each `create_response()` call and from the `create_response_stream()` companion `OnceLock`. After the stream completes, it yields the total via `AgentEvent::Usage(cumulative_usage)`.

### 3. Eval test file (`tests/eval.rs`) — new file

This is the bulk of the work. Contains:
- The `EvalCase` and `CaseResult` structs
- The trace capture function
- The eval case definitions
- The scoring logic
- The report printer

## Initial eval suite

These cases cover the current 9 tools across different question types. The expected/banned tools encode our understanding of *correct* agent behavior — this is the opinionated part.

### Category 1: Simple stat lookups

```
Case: "basic_dps"
  Q: "What is my total DPS and what main skill am I using?"
  Expected: [get_build_stats, get_skill_list]
  Banned: [get_item, get_passive_tree, get_jewel, query_passive_stats]
  Max rounds: 2
  Answer contains: ["DPS"]

Case: "defensive_stats"
  Q: "How tanky is this build? What are my defenses?"
  Expected: [get_build_stats]
  Banned: [get_item, get_jewel]
  Max rounds: 2
  Answer contains: ["life" or "energy shield", "armour" or "evasion"]

Case: "skill_gems"
  Q: "What support gems are linked to my main skill?"
  Expected: [get_skill_list]
  Banned: [get_item, get_passive_tree, get_jewel]
  Max rounds: 2
  Answer contains: []  // hard to assert specific gem names without knowing fixture
```

### Category 2: Gear inspection

```
Case: "specific_item"
  Q: "What weapon am I using?"
  Expected: [get_item]
  Banned: [get_passive_tree, get_jewel, query_passive_stats, get_unallocated_ascendancy]
  Max rounds: 2
  Answer contains: ["Weapon"]

Case: "missing_gear"
  Q: "Am I missing any gear? What slots are empty?"
  Expected: [get_empty_slots]
  Banned: [get_passive_tree, get_jewel, query_passive_stats]
  Max rounds: 2
  Answer contains: []

Case: "gear_overview_then_detail"
  Q: "What's my worst piece of gear and how could I upgrade it?"
  Expected: [get_empty_slots]  // should scan first
  Banned: [get_jewel, query_passive_stats]
  Max rounds: 4  // may need multiple get_item calls
  Answer contains: ["upgrade"]
```

### Category 3: Passive tree

```
Case: "keystones"
  Q: "What keystones am I using?"
  Expected: [get_passive_tree]
  Banned: [get_item, get_config, get_jewel]
  Max rounds: 2
  Answer contains: []

Case: "jewel_inspection"
  Q: "What jewels do I have socketed in my passive tree?"
  Expected: [get_passive_tree, get_jewel]
  Banned: [get_item, get_config, query_passive_stats]
  Max rounds: 3  // tree first, then jewel calls
  Answer contains: []

Case: "stat_sourcing"
  Q: "How much fire damage am I getting from the passive tree, and is there more nearby?"
  Expected: [query_passive_stats]
  Banned: [get_item, get_jewel, get_config]
  Max rounds: 2
  Answer contains: ["fire damage"]
```

### Category 4: Ascendancy

```
Case: "ascendancy_recommendation"
  Q: "What ascendancy nodes should I take next?"
  Expected: [get_unallocated_ascendancy]
  Banned: [get_item, get_jewel]
  Max rounds: 2
  Answer contains: []

Case: "ascendancy_current"
  Q: "What ascendancy am I playing and what nodes do I have?"
  Expected: [get_unallocated_ascendancy]
  Banned: [get_item, get_config]
  Max rounds: 2
  Answer contains: []
```

### Category 5: Config and setup

```
Case: "build_config"
  Q: "What enemy level is my build configured for?"
  Expected: [get_config]
  Banned: [get_item, get_passive_tree, get_jewel]
  Max rounds: 2
  Answer contains: []
```

### Category 6: Multi-tool complex questions

```
Case: "full_build_review"
  Q: "Give me a quick overview of this build — what's working and what needs improvement?"
  Expected: [get_build_stats]
  Banned: []  // agent has latitude here
  Max rounds: 5
  Answer contains: []
  Max tokens: 8000

Case: "upgrade_priorities"
  Q: "What are the top 3 things I should upgrade on this build?"
  Expected: [get_build_stats]
  Banned: []
  Max rounds: 5
  Answer contains: []
  Max tokens: 10000
```

### Category 7: Edge cases

```
Case: "no_tools_needed"
  Q: "What is Path of Exile 2?"
  Expected: []  // should answer from knowledge, no tools
  Banned: []    // calling tools isn't wrong, just wasteful
  Max rounds: 1
  Answer contains: ["Path of Exile"]

Case: "ambiguous_item_slot"
  Q: "Show me my ring"
  Expected: [get_item]  // should pick a ring slot
  Banned: [get_passive_tree, get_jewel]
  Max rounds: 2
  Answer contains: []
```

## Report format

The eval prints a scorecard to stderr (like profiling.rs):

```
=== Agent Eval Report (model: gpt-4.1-nano) ===
  Fixture: ranger-with-gear.xml
  Cases:   15
  Passed:  12 (80%)
  Warned:  2
  Failed:  1

  --- Per-case results ---

  ✓ basic_dps                    2 rounds   1847 tok   3.2s
    tools: get_build_stats, get_skill_list
  ✓ defensive_stats              1 round    1203 tok   2.1s
    tools: get_build_stats
  ✗ specific_item                3 rounds   2891 tok   5.4s
    tools: get_build_stats(!), get_item
    FAIL: unexpected tool get_build_stats (not banned, but extra)
    FAIL: 3 rounds > max 2
  ~ gear_overview_then_detail    4 rounds   4102 tok   7.8s
    tools: get_empty_slots, get_item, get_item, get_item
    WARN: token budget exceeded (4102 > 4000)

  --- Aggregate ---

  Tool selection accuracy:  87%  (13/15 cases all expected tools called)
  No-banned-tool rate:     100%  (0/15 cases called a banned tool)
  Efficiency rate:          93%  (14/15 within round budget)
  Answer correctness:       80%  (12/15 all required facts present)
  Avg total tokens:        2340
  Avg latency:             3.8s
  Unnecessary call rate:    0.3 tools/question
```

## Running the eval

```bash
# Full suite (requires API key, costs real money)
OPENAI_API_KEY=sk-... cargo test --test eval -- --nocapture

# Specific case
cargo test --test eval basic_dps -- --nocapture

# With a different model (compare behavior across models)
OPENAI_MODEL=gpt-5-mini cargo test --test eval -- --nocapture

# Quick subset for CI (pick 3-4 fast representative cases)
cargo test --test eval -- --nocapture quick
```

Expect the full suite to take 30-90 seconds and cost ~$0.01-0.05 depending on model.

## Gating new tools

Before adding a new tool to the agent:

1. **Run the eval suite** → record baseline scorecard
2. **Add the tool** (definition + execute_tool dispatch + system prompt guidance)
3. **Add 2-3 eval cases** for the new tool's intended use cases
4. **Run the eval suite again** → compare:
   - Did any *existing* cases regress? (Tool selection accuracy dropped, extra tools appearing)
   - Do the new cases pass? (Agent actually uses the tool correctly)
   - Did aggregate token cost increase significantly?

If existing cases regress, the tool description or system prompt needs work before merging. This gives you a principled answer to "should I add this tool?" instead of vibes.

## Future extensions

Things explicitly out of scope for v1 but worth considering later:

- **Multiple fixtures**: Run the same eval cases against different builds (melee, caster, minion) to test generalization. The fixture is just a parameter.
- **LLM-as-judge for answer quality**: Instead of substring matching, use a cheap model to score "did the answer correctly use the data from the tools?" This catches cases where the agent calls the right tools but misinterprets the results.
- **Regression tracking**: Save eval results to a JSON file per run, compare across runs automatically. Plot trends over time.
- **Conversation eval**: Multi-turn cases where the second question builds on the first. Tests whether the agent avoids redundant tool calls when it already has data.
- **A/B on system prompt changes**: Run eval before/after a prompt edit to quantify impact.