Module benchmark

reasonkit::thinktool

Module benchmark

Expand description

§Benchmark Harness for ThinkTools Evaluation

Provides infrastructure to measure reasoning quality improvements against established benchmarks (GSM8K, MATH, TruthfulQA, etc.)

§Supported Benchmarks

Benchmark	Type	Metric	Target
GSM8K	Math reasoning	Accuracy	85.9%
MATH	Advanced math	Accuracy	36.5%
TruthfulQA	Factuality	MC1/MC2	72%
Game of 24	Creative	Success rate	60%+
ARC-C	Science	Accuracy	90%

Modules§

gsm8k: GSM8K-specific loader

Structs§

BenchmarkProblem: Benchmark problem from evaluation set
BenchmarkResults: Aggregate benchmark results
BenchmarkRunner: Benchmark runner
CalibrationMetrics: Calibration metrics for confidence assessment
ComparisonReport
ConfidenceBin
EvaluationResult: Result of evaluating a single problem

Enums§

Answer: Answer type - handles different benchmark formats