pub fn compare_results(
baseline: &GenAIEvalResults,
comparison: &GenAIEvalResults,
regression_threshold: f64,
) -> Result<ComparisonResults, EvaluationError>Expand description
Compares two GenAIEvalResults datasets and produces a ComparisonResults summary.
Every workflow is compared against the baseline. The comparison identifies:
- Tasks that passed/failed in both datasets
- Tasks that changed status between baseline and comparison
- Tasks missing in either dataset
- Overall pass rate deltas and regression detection
§Arguments
baseline- The baseline evaluation results to compare againstcomparison- The evaluation results being comparedregression_threshold- Pass rate delta threshold to flag as regression
§Returns
A ComparisonResults struct containing workflow comparisons, task-level changes,
and aggregate statistics.
§Errors
Returns EvaluationError if comparison processing fails.
§Algorithm
- Map baseline and comparison results by
record_uid, filtering for successful runs - For each record present in both datasets:
- Build task maps keyed by
task_id - Compare task pass/fail status for all matched tasks
- Track tasks only in baseline or comparison
- Build task maps keyed by
- Aggregate workflow-level statistics (pass rates, deltas, regressions)