Expand description
§oxibonsai-eval
Model evaluation harness for OxiBonsai.
Provides utilities for:
- Perplexity — measures how well a model predicts held-out text.
- MMLU-style multiple choice — accuracy on four-option questions
(both string-parsing
McEvaluatorand logit-basedaccuracy::McLogitEvaluator). - Exact match — token-level accuracy for text-generation tasks.
- BLEU — corpus / sentence BLEU with 1..N orders and smoothing.
- chrF / chrF++ — character n-gram F-score (Popović 2015).
- METEOR (lexical) — exact-match-only METEOR.
- SQuAD F1 + EM — standard SQuAD 1.1 normalisation.
- Calibration — ECE, Brier score, NLL (numerically stable).
- Bootstrap CIs — seed-deterministic percentile intervals.
- Streaming / online — running perplexity and accuracy counters.
- Throughput benchmarking — tokens-per-second and latency statistics.
- Dataset loading — JSONL-based
EvalDatasetandMcDataset. - Report generation — JSON and Markdown evaluation reports.
§Quick start
use oxibonsai_eval::perplexity::PerplexityEvaluator;
let eval = PerplexityEvaluator::new();
// Perfect predictions → PPL ≈ 1.0
let ppl = eval.compute(&[0.0f32; 10]);
assert!((ppl - 1.0).abs() < 1e-5);Re-exports§
pub use accuracy::AccuracyResult;pub use accuracy::ExactMatchEvaluator;pub use accuracy::LogitMcResult;pub use accuracy::McEvaluator;pub use accuracy::McLogitEvaluator;pub use arc::ArcEvaluator;pub use arc::ArcResult;pub use arc::ArcSplit;pub use bleu::corpus_bleu;pub use bleu::sentence_bleu;pub use bleu::BleuConfig;pub use bleu::BleuScore;pub use bleu::SmoothingMethod;pub use boolq::BoolQDataset;pub use boolq::BoolQEvaluator;pub use boolq::BoolQItem;pub use boolq::BoolQResult;pub use bootstrap::bootstrap_ci;pub use bootstrap::ConfidenceInterval;pub use calibration::brier_score;pub use calibration::calibration_all;pub use calibration::expected_calibration_error;pub use calibration::nll_from_logits;pub use calibration::BinStat;pub use calibration::CalibrationResult;pub use chrf::chrf;pub use chrf::chrf_plus_plus;pub use chrf::chrf_with;pub use chrf::ChrfScore;pub use dataset::EvalDataset;pub use dataset::EvalExample;pub use dataset::McDataset;pub use dataset::MultipleChoiceQuestion;pub use error::EvalError;pub use gsm8k::Gsm8kEvaluator;pub use gsm8k::Gsm8kResult;pub use hellaswag::HellaSwagDataset;pub use hellaswag::HellaSwagEvaluator;pub use hellaswag::HellaSwagItem;pub use hellaswag::HellaSwagResult;pub use meteor::align_tokens;pub use meteor::meteor;pub use meteor::meteor_multi;pub use meteor::MeteorConfig;pub use meteor::MeteorScore;pub use mmlu::MmluEvaluator;pub use mmlu::MmluResult;pub use perplexity::PerplexityEvaluator;pub use perplexity::PerplexityResult;pub use qa::corpus_em_f1;pub use qa::exact_match as qa_exact_match;pub use qa::f1_score as qa_f1_score;pub use qa::normalize_answer;pub use qa::normalize_tokens;pub use qa::score_multi as qa_score_multi;pub use qa::QaScore;pub use report::EvalReport;pub use report::EvalResultEntry;pub use rouge::ngram_counts;pub use rouge::tokenize;pub use rouge::CorpusRouge;pub use rouge::RougeLScore;pub use rouge::RougeNScore;pub use rouge::RougeSScore;pub use rouge::TokenSeq;pub use streaming::OnlineAccuracy;pub use streaming::OnlinePerplexity;pub use throughput::percentile;pub use throughput::time_fn;pub use throughput::ThroughputBenchmark;pub use throughput::ThroughputResult;pub use truthfulqa::TruthfulQaDataset;pub use truthfulqa::TruthfulQaEvaluator;pub use truthfulqa::TruthfulQaItem;pub use truthfulqa::TruthfulQaMode;pub use truthfulqa::TruthfulQaResult;pub use winogrande::WinoGrandeDataset;pub use winogrande::WinoGrandeEvaluator;pub use winogrande::WinoGrandeItem;pub use winogrande::WinoGrandeResult;
Modules§
- accuracy
- Accuracy evaluation: multiple-choice (MMLU-style) and exact-match scoring.
- arc
- ARC (AI2 Reasoning Challenge) evaluation harness.
- bleu
- BLEU (Bilingual Evaluation Understudy) implementation.
- boolq
- BoolQ yes/no question answering evaluation harness.
- bootstrap
- Bootstrap confidence intervals.
- calibration
- Calibration metrics — ECE, Brier score, NLL.
- chrf
- chrF and chrF++ — character n-gram F-score (Popović 2015).
- dataset
- Dataset types and loaders for the evaluation harness.
- error
- Error types for the evaluation harness.
- gsm8k
- GSM8K (Grade School Math 8K) evaluation harness.
- hellaswag
- HellaSwag sentence-completion evaluation harness.
- meteor
- METEOR (lexical subset) — exact-match only.
- mmlu
- MMLU (Massive Multitask Language Understanding) evaluation harness.
- perplexity
- Perplexity evaluator.
- qa
- SQuAD-style QA evaluation — Exact Match (EM) and token F1.
- report
- Evaluation report builder.
- rouge
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics.
- streaming
- Streaming / online evaluation state machines.
- throughput
- Throughput benchmarking for LLM inference.
- truthfulqa
- TruthfulQA evaluation harness.
- winogrande
- WinoGrande commonsense reasoning evaluation harness.