Expand description
Reliability: does a program parse/run without ambiguity, and when it fails, is the failure actionable?
Two things drive an agent’s retry-token blowup. First, ambiguity: a form the
model mis-emits and has to redo. Second, dead-end errors: prose like
"error near line 3" tells the model nothing to branch on, so it guesses.
This module aggregates the outcomes of running a program over a set of
representative invocations into a pass rate and an actionable-failure rate
(failures that carry a structured, machine-branchable code + hint).
Structs§
- Error
Quality - A graded assessment of how actionable a failure is — refining the binary
structured_errorofOutcome. Each component an agent can use to self-correct contributes equally to the score. - Error
Quality Report - Aggregate
ErrorQualityover a set of failures. - Outcome
- The outcome of one invocation, as classified by the caller.
- Reliability
Report - Aggregate reliability over a set of invocations.
Functions§
- assess_
error_ quality - Grade each failure with
gradeand aggregate. Pass only the failing cases (the successes carry no error to grade); an empty set scores 1.0 vacuously. - assess_
reliability - Assess reliability by running
runover each case and aggregating outcomes.