Expand description
Evaluation and benchmarking utilities.
§Evaluation Module
Provides quality metrics for AI REASONING EVALUATION.
ReasonKit measures whether AI THINKS BETTER with ThinkTool protocols.
§Core Metrics (Tier 1 - Reasoning Quality)
- Accuracy: Correct answers on reasoning benchmarks (GSM8K, MATH, ARC-C)
- Improvement Delta: Performance with vs without ThinkTools
- Self-Consistency: Same answer across multiple runs
- Calibration: Confidence matches accuracy
§ThinkTool Metrics (Tier 2)
- GigaThink: Perspective count, coverage, novelty
- LaserLogic: Validity rate, fallacy detection
- BedRock: Decomposition depth, axiom validity
- ProofGuard: Triangulation rate, contradiction detection
- BrutalHonesty: Flaw detection rate, improvement suggestions
§Supporting Metrics (Tier 5)
- Recall@K: For source retrieval (ProofGuard support)
- Latency: Performance measurements
§Usage
ⓘ
use reasonkit::evaluation::{ReasoningMetrics, BenchmarkResult};
// Run GSM8K benchmark with --balanced profile
let baseline = run_benchmark("gsm8k", None);
let treatment = run_benchmark("gsm8k", Some("balanced"));
let improvement = treatment.accuracy - baseline.accuracy;
println!("Improvement: {:.1}%", improvement * 100.0);Re-exports§
pub use metrics::average_precision;pub use metrics::mean_average_precision;pub use metrics::mean_reciprocal_rank;pub use metrics::ndcg_at_k;pub use metrics::precision_at_k;pub use metrics::recall_at_k;pub use metrics::EvaluationResult;pub use metrics::QueryResult;pub use metrics::RetrievalMetrics;pub use reasoning::BedRockMetrics;pub use reasoning::BenchmarkResult;pub use reasoning::BrutalHonestyMetrics;pub use reasoning::CalibrationMetrics;pub use reasoning::ConsistencyMetrics;pub use reasoning::GigaThinkMetrics;pub use reasoning::LaserLogicMetrics;pub use reasoning::Profile;pub use reasoning::ProofGuardMetrics;pub use reasoning::ReasoningMetrics;pub use reasoning::ThinkToolMetrics;
Modules§
Structs§
- Reasoning
Eval Config - Reasoning evaluation configuration
- Reasoning
Eval Summary - Summary of reasoning evaluation
- Reasoning
Targets - Reasoning quality targets for release gates
- Retrieval
Eval Config - Retrieval evaluation configuration (Tier 5 - Supporting)
- Target
Result - Result of target check
Functions§
- evaluate_
reasoning - Run full reasoning evaluation