pub struct LlmJudgeEvaluation {
pub judge_config: AgentLoopConfig,
pub system_prompt: Option<String>,
}Expand description
Uses a separate LLM call to judge which branch response is best.
§Judge prompt construction
The judge sees only clean, relevant content — never raw tool calls or intermediate steps from inside a branch:
- Prior conversation context (when present): the conversation history before
the user query, formatted as a human-readable transcript. Only
Content::Textsurvives — tool call arguments and images are stripped. Omitted when empty. - Original query: text extracted from user messages in
prompts(agent_loopmode), or from the lastMessage::Userincontext.messages[..original_context_len](agent_loop_continuemode). - Per-branch response: the final assistant text from the last
Message::Assistantinoutcome.new_messages. Tool calls, tool results, and all multi-turn exchanges within a branch are stripped. The judge evaluates outcomes, not the reasoning trace.
§agent_loop_continue mode
When prompts is empty (continue mode), the judge locates the last
Message::User in context.messages[..original_context_len] as the query.
Everything before that message becomes the prior conversation context.
§Judge’s comprehension criteria
All N branch final responses (plus prior context) must fit in the judge model’s
context window simultaneously for a fair comparison. The token budget is
derived from judge_config.context_config.max_context_tokens (if set).
When no context limit is configured, all content is passed through as-is.
§2-iteration compaction strategy
When combined content exceeds the budget, compaction is applied in two iterations:
Iteration 1 — compact prior context only, outputs intact. The prior context is reduced through 3 progressive tiers while branch outputs are preserved verbatim:
- Tier 1: keep only the last 80 lines.
- Tier 2: keep first paragraph + last paragraph only.
- Tier 3: hard char limit derived from remaining budget.
Iteration 2 — compact both independently (if iteration 1 insufficient). Context stays at tier-3 form; branch outputs are now compacted independently through the same 3-tier pipeline.
A AgentEvent::ProgressMessage warning is emitted to tx if the budget
cannot be satisfied after both iterations.
The judge’s decision applies to the original (uncompacted) branch responses.
ParallelLoopResult::selected_messages always contains the uncompacted winner.
§Response parsing
The judge’s reply is scanned for the first numeric token (e.g., “1”, “2”, “Response 2”). Falls back to index 0 if no number is found or parsing fails.
§Session traceability
The judge loop inherits the session_id from the branches so all events
(including the judge’s AgentStart) are visible in the same session trace.
Fields§
§judge_config: AgentLoopConfigConfig for the judge LLM call. Set context_config.max_context_tokens
to enable the comprehension-criteria compaction check.
system_prompt: Option<String>Optional system prompt override. When None, a built-in evaluation prompt is used.