pub struct Evaluator { /* private fields */ }Expand description
Evaluates a subject model against a benchmark dataset using an LLM judge.
Evaluator runs each BenchmarkCase against a subject model to obtain a
response, then scores all responses in parallel using a separate judge model.
The judge is prompted to return a JudgeOutput with a score in [1, 10].
§Token Budget
A cumulative token budget is enforced across all judge calls in a single
evaluate invocation. When the budget is exceeded the report has
is_partial = true and the remaining futures are drained (any that already
completed successfully are included in the scores).
§Concurrency
Subject calls are sequential; judge calls are parallelized up to
parallel_evals (default: 3) via a tokio semaphore.
§Examples
let judge = Arc::new(AnyProvider::Mock(MockProvider::with_responses(vec![
r#"{"score": 8.0, "reason": "mostly correct"}"#.into(),
])));
let subject = AnyProvider::Mock(MockProvider::with_responses(vec!["42".into()]));
let benchmark = BenchmarkSet {
cases: vec![BenchmarkCase {
prompt: "What is 6×7?".into(),
context: None,
reference: Some("42".into()),
tags: None,
}],
};
let evaluator = Evaluator::new(judge, benchmark, 50_000)?;
let report = evaluator.evaluate(&subject).await?;
assert_eq!(report.cases_scored, 1);Implementations§
Source§impl Evaluator
impl Evaluator
Sourcepub fn new(
judge: Arc<AnyProvider>,
benchmark: BenchmarkSet,
budget_tokens: u64,
) -> Result<Self, EvalError>
pub fn new( judge: Arc<AnyProvider>, benchmark: BenchmarkSet, budget_tokens: u64, ) -> Result<Self, EvalError>
Sourcepub fn with_parallel_evals(self, n: usize) -> Self
pub fn with_parallel_evals(self, n: usize) -> Self
Override the default concurrency limit for judge calls.
The default is 3. A value of 0 is silently promoted to 1 (at least one judge call can run at a time).
§Examples
let judge = Arc::new(AnyProvider::Mock(MockProvider::with_responses(vec![])));
let benchmark = BenchmarkSet {
cases: vec![BenchmarkCase {
prompt: "hi".into(), context: None, reference: None, tags: None,
}],
};
let evaluator = Evaluator::new(judge, benchmark, 10_000)?.with_parallel_evals(5);Sourcepub async fn evaluate(
&self,
subject: &AnyProvider,
) -> Result<EvalReport, EvalError>
pub async fn evaluate( &self, subject: &AnyProvider, ) -> Result<EvalReport, EvalError>
Run the full benchmark against subject, returning aggregate scores.
Subject calls are sequential; judge calls are parallelized up to
parallel_evals concurrent tasks. A per-invocation token budget is
enforced across all judge calls.
§Errors
Returns EvalError::Llm if any subject call fails fatally.
Budget exhaustion and judge errors are handled gracefully (excluded from scores).
Auto Trait Implementations§
impl Freeze for Evaluator
impl !RefUnwindSafe for Evaluator
impl Send for Evaluator
impl Sync for Evaluator
impl Unpin for Evaluator
impl UnsafeUnpin for Evaluator
impl !UnwindSafe for Evaluator
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> IntoRequest<T> for T
impl<T> IntoRequest<T> for T
Source§fn into_request(self) -> Request<T>
fn into_request(self) -> Request<T>
T in a tonic::Request