oxibonsai-eval
Model evaluation harness for OxiBonsai — ROUGE, perplexity, accuracy, throughput.
Provides perplexity measurement, MMLU-style multiple-choice accuracy, ROUGE-N/L/S scoring, exact-match scoring, throughput benchmarking, JSONL dataset loading, and JSON/Markdown report generation.
Part of the OxiBonsai project.
Status
Stable (v0.1.3) — 151 tests passing.
Features
PerplexityEvaluator— from log-probs or logits; bits-per-byte metricMcEvaluator— MMLU-style multiple-choice with per-subject breakdownExactMatchEvaluator— text-match evaluation;exact_match/f1_scoreQA scoring- ROUGE scoring:
RougeNScore(ROUGE-1/2),RougeLScore,RougeSScore,CorpusRouge - BLEU scoring:
BleuScore,sentence_bleu,corpus_bleu - ChrF (character n-gram F-score) metric
- METEOR metric
- Bootstrap confidence intervals for all metrics
ThroughputBenchmark— tokens/s, prefill/decode latency, p95/p99EvalDataset— JSONL loading, train/test splits, deterministic samplingEvalReport— JSON and Markdown report generation- Zero external API dependencies — pure Rust
Usage
[]
= "0.1.3"
use ;
// Perplexity from token log-probabilities
let log_probs = vec!;
let ppl = from_log_probs;
println!;
// Corpus BLEU
let hypotheses = vec!;
let references = vec!;
let bleu = corpus_bleu;
println!;
License
Apache-2.0 — COOLJAPAN OU