Module eval

Expand description

Evaluation dataset and scorer metadata.

These types make experiment records explicit about what was measured and how. The current scorer is intentionally simple, but the optimizer now has a stable place to grow policy-aligned and LLM-judge scoring.