Expand description
§Benchmark Harness for ThinkTools Evaluation
Provides infrastructure to measure reasoning quality improvements against established benchmarks (GSM8K, MATH, TruthfulQA, etc.)
§Supported Benchmarks
| Benchmark | Type | Metric | Target |
|---|---|---|---|
| GSM8K | Math reasoning | Accuracy | 85.9% |
| MATH | Advanced math | Accuracy | 36.5% |
| TruthfulQA | Factuality | MC1/MC2 | 72% |
| Game of 24 | Creative | Success rate | 60%+ |
| ARC-C | Science | Accuracy | 90% |
Modules§
- gsm8k
- GSM8K-specific loader
Structs§
- Benchmark
Problem - Benchmark problem from evaluation set
- Benchmark
Results - Aggregate benchmark results
- Benchmark
Runner - Benchmark runner
- Calibration
Metrics - Calibration metrics for confidence assessment
- Comparison
Report - Confidence
Bin - Evaluation
Result - Result of evaluating a single problem
Enums§
- Answer
- Answer type - handles different benchmark formats