Skip to main content

compare_results

scouter_evaluate::evaluate::compare

Function compare_results

pub fn compare_results(
    baseline: &GenAIEvalResults,
    comparison: &GenAIEvalResults,
    regression_threshold: f64,
) -> Result<ComparisonResults, EvaluationError>

Expand description

Compares two GenAIEvalResults datasets and produces a ComparisonResults summary.

Every workflow is compared against the baseline. The comparison identifies:

Tasks that passed/failed in both datasets
Tasks that changed status between baseline and comparison
Tasks missing in either dataset
Overall pass rate deltas and regression detection

§Arguments

baseline - The baseline evaluation results to compare against
comparison - The evaluation results being compared
regression_threshold - Pass rate delta threshold to flag as regression

§Returns

A ComparisonResults struct containing workflow comparisons, task-level changes, and aggregate statistics.

§Errors

Returns EvaluationError if comparison processing fails.

§Algorithm

Map baseline and comparison results by record_uid, filtering for successful runs
For each record present in both datasets:
- Build task maps keyed by task_id
- Compare task pass/fail status for all matched tasks
- Track tasks only in baseline or comparison
Aggregate workflow-level statistics (pass rates, deltas, regressions)