Expand description
Evaluation dataset and scorer metadata.
These types make experiment records explicit about what was measured and how. The current scorer is intentionally simple, but the optimizer now has a stable place to grow policy-aligned and LLM-judge scoring.