Module eval

Expand description

Pareto evaluation harness for [DerivePolicy].

§Role (Phase B step 23)

Liu et al. (arXiv:2512.22087), Lu et al. (arXiv:2510.06727), and ClawVM (arXiv:2604.10352) agree on one methodological point: a policy is never flipped on aggregate accuracy alone. Pareto over (accuracy, context-cost, reuse-rate, oracle-gap) is the minimum. This module is the comparator half of that evaluation loop. The runner half — actually executing [derive_with_policy] over a fixture corpus and collecting PolicyRunResults — is a follow-up commit.

§Scope

PolicyRunResult — per-policy datapoint.
pareto_frontier — drop dominated points, keep frontier.
reuse_rate — Cognitive-Workspace-style warm-hit proxy.

The ClawVM Tier-1 fault regression suite is a sibling concern; those tests live as #[test] functions on the subsystems they exercise (compression, journal, session_recall) rather than as harness inputs.

§Examples

use codetether_agent::session::eval::{PolicyRunResult, pareto_frontier, reuse_rate};

let results = vec![
    PolicyRunResult {
        policy: "legacy",
        kept_messages: 30,
        context_tokens: 24_000,
        fault_count: 12,
        oracle_gap: 4,
        reuse_rate: 0.50,
    },
    PolicyRunResult {
        policy: "reset",
        kept_messages: 8,
        context_tokens: 6_500,
        fault_count: 5,
        oracle_gap: 2,
        reuse_rate: 0.62,
    },
];
let frontier = pareto_frontier(&results);
assert!(frontier.iter().any(|r| r.policy == "reset"));

// 4 of 8 context entries were already warm from the prior turn.
assert!((reuse_rate(&[4, 8]) - 0.5).abs() < 1e-9);

Structs§

PolicyRunResult: One Pareto sample for a single derivation policy run.

Functions§

pareto_frontier: Drop dominated points and return references to the frontier.
reuse_rate: Compute the reuse rate from (warm_hits, total).

Module eval

Module eval Copy item path

§Role (Phase B step 23)

§Scope

§Examples

Structs§

Functions§

Module eval