Expand description
LLM-as-judge evaluator for benchmark datasets.
Evaluator runs each benchmark case against a subject model, then scores the
responses in parallel using a separate judge model. Token budget enforcement and
concurrency limits are applied per Evaluator::evaluate invocation.
Structsยง
- Case
Score - Score for a single benchmark case.
- Eval
Report - Aggregate evaluation report returned by
Evaluator::evaluate. - Evaluator
- Evaluates a subject model against a benchmark dataset using an LLM judge.
- Judge
Output - Structured output returned by the judge LLM.