Module evaluation

Module evaluation 

Source
Expand description

Evaluation and benchmarking utilities.

§Evaluation Module

Provides quality metrics for AI REASONING EVALUATION.

ReasonKit measures whether AI THINKS BETTER with ThinkTool protocols.

§Core Metrics (Tier 1 - Reasoning Quality)

  • Accuracy: Correct answers on reasoning benchmarks (GSM8K, MATH, ARC-C)
  • Improvement Delta: Performance with vs without ThinkTools
  • Self-Consistency: Same answer across multiple runs
  • Calibration: Confidence matches accuracy

§ThinkTool Metrics (Tier 2)

  • GigaThink: Perspective count, coverage, novelty
  • LaserLogic: Validity rate, fallacy detection
  • BedRock: Decomposition depth, axiom validity
  • ProofGuard: Triangulation rate, contradiction detection
  • BrutalHonesty: Flaw detection rate, improvement suggestions

§Supporting Metrics (Tier 5)

  • Recall@K: For source retrieval (ProofGuard support)
  • Latency: Performance measurements

§Usage

use reasonkit::evaluation::{ReasoningMetrics, BenchmarkResult};

// Run GSM8K benchmark with --balanced profile
let baseline = run_benchmark("gsm8k", None);
let treatment = run_benchmark("gsm8k", Some("balanced"));

let improvement = treatment.accuracy - baseline.accuracy;
println!("Improvement: {:.1}%", improvement * 100.0);

Re-exports§

pub use metrics::average_precision;
pub use metrics::mean_average_precision;
pub use metrics::mean_reciprocal_rank;
pub use metrics::ndcg_at_k;
pub use metrics::precision_at_k;
pub use metrics::recall_at_k;
pub use metrics::EvaluationResult;
pub use metrics::QueryResult;
pub use metrics::RetrievalMetrics;
pub use reasoning::BedRockMetrics;
pub use reasoning::BenchmarkResult;
pub use reasoning::BrutalHonestyMetrics;
pub use reasoning::CalibrationMetrics;
pub use reasoning::ConsistencyMetrics;
pub use reasoning::GigaThinkMetrics;
pub use reasoning::LaserLogicMetrics;
pub use reasoning::Profile;
pub use reasoning::ProofGuardMetrics;
pub use reasoning::ReasoningMetrics;
pub use reasoning::ThinkToolMetrics;

Modules§

metrics
Core retrieval evaluation metrics.
reasoning
Reasoning Quality Metrics

Structs§

ReasoningEvalConfig
Reasoning evaluation configuration
ReasoningEvalSummary
Summary of reasoning evaluation
ReasoningTargets
Reasoning quality targets for release gates
RetrievalEvalConfig
Retrieval evaluation configuration (Tier 5 - Supporting)
TargetResult
Result of target check

Functions§

evaluate_reasoning
Run full reasoning evaluation