Expand description
Model evaluation and benchmarking framework (spec §7.10)
Model Evaluation and Benchmarking Framework (aprender::bench)
Provides multi-model comparison for evaluating .apr models on custom tasks.
Unlike QA (single-model validation), this module compares multiple models
to find the smallest model that meets a performance threshold.
§Toyota Way Alignment
- Pull Systems (P3): Pareto frontier pulls smallest viable model
- Muda Elimination: Avoid overprovisioning with right-sized models
§References
- Deb et al. (2002) “NSGA-II” for Pareto optimization
§Example
use aprender::bench::{EvalResult, ModelComparison};
let comparison = ModelComparison::new("python-to-rust");
assert!(comparison.results.is_empty());Modules§
Structs§
- Eval
Result - Result of evaluating a single model on a single task
- Eval
Suite Config - Evaluation suite configuration
- Example
- Example input for evaluation
- Example
Result - Result for a single example
- Model
Comparison - Compare multiple models on the same task
- Pareto
Point - Point on the Pareto frontier
- Recommendation
- Recommendation for a specific scenario
Enums§
- Difficulty
- Difficulty tier for stratified analysis
- Example
Status - Status of an example evaluation
Traits§
- Eval
Task - Custom evaluation task trait