Module benchmark

Module benchmark 

Source
Expand description

§Benchmark Harness for ThinkTools Evaluation

Provides infrastructure to measure reasoning quality improvements against established benchmarks (GSM8K, MATH, TruthfulQA, etc.)

§Supported Benchmarks

BenchmarkTypeMetricTarget
GSM8KMath reasoningAccuracy85.9%
MATHAdvanced mathAccuracy36.5%
TruthfulQAFactualityMC1/MC272%
Game of 24CreativeSuccess rate60%+
ARC-CScienceAccuracy90%

Modules§

gsm8k
GSM8K-specific loader

Structs§

BenchmarkProblem
Benchmark problem from evaluation set
BenchmarkResults
Aggregate benchmark results
BenchmarkRunner
Benchmark runner
CalibrationMetrics
Calibration metrics for confidence assessment
ComparisonReport
ConfidenceBin
EvaluationResult
Result of evaluating a single problem

Enums§

Answer
Answer type - handles different benchmark formats