Expand description
Axis 9: schema / format conformance rate.
Intent-gated on the baseline side. A pair is counted (and scored) if the baseline response has EITHER:
-
JSON-text intent: its text starts with
{or[(after fence- strip). Both sides are scored on whether their text parses. -
Tool-use intent: it emits at least one
tool_useblock with a dict-shapedinput. Both sides are scored on whether their (first) tool_use’s input keys match the baseline’s. This covers the common agent pattern where the “structured final answer” is a tool call (e.g.submit_answer(...)) rather than JSON text — the same conformance question in a different syntax.
Pairs where baseline has neither JSON intent nor tool_use intent are excluded (we don’t penalise the candidate for not structuring when nothing asked for structure).
Functions§
- compute
- Compute the schema-conformance axis.