Skip to main content

compare_results

Function compare_results 

Source
pub fn compare_results(
    baseline: &GenAIEvalResults,
    comparison: &GenAIEvalResults,
    regression_threshold: f64,
) -> Result<ComparisonResults, EvaluationError>
Expand description

Compares two GenAIEvalResults datasets and produces a ComparisonResults summary.

Every workflow is compared against the baseline. The comparison identifies:

  • Tasks that passed/failed in both datasets
  • Tasks that changed status between baseline and comparison
  • Tasks missing in either dataset
  • Overall pass rate deltas and regression detection

§Arguments

  • baseline - The baseline evaluation results to compare against
  • comparison - The evaluation results being compared
  • regression_threshold - Pass rate delta threshold to flag as regression

§Returns

A ComparisonResults struct containing workflow comparisons, task-level changes, and aggregate statistics.

§Errors

Returns EvaluationError if comparison processing fails.

§Algorithm

  1. Map baseline and comparison results by record_uid, filtering for successful runs
  2. For each record present in both datasets:
    • Build task maps keyed by task_id
    • Compare task pass/fail status for all matched tasks
    • Track tasks only in baseline or comparison
  3. Aggregate workflow-level statistics (pass rates, deltas, regressions)