Skip to main content

Crate oxibonsai_eval

Crate oxibonsai_eval 

Source
Expand description

§oxibonsai-eval

Model evaluation harness for OxiBonsai.

Provides utilities for:

  • Perplexity — measures how well a model predicts held-out text.
  • MMLU-style multiple choice — accuracy on four-option questions (both string-parsing McEvaluator and logit-based accuracy::McLogitEvaluator).
  • Exact match — token-level accuracy for text-generation tasks.
  • BLEU — corpus / sentence BLEU with 1..N orders and smoothing.
  • chrF / chrF++ — character n-gram F-score (Popović 2015).
  • METEOR (lexical) — exact-match-only METEOR.
  • SQuAD F1 + EM — standard SQuAD 1.1 normalisation.
  • Calibration — ECE, Brier score, NLL (numerically stable).
  • Bootstrap CIs — seed-deterministic percentile intervals.
  • Streaming / online — running perplexity and accuracy counters.
  • Throughput benchmarking — tokens-per-second and latency statistics.
  • Dataset loading — JSONL-based EvalDataset and McDataset.
  • Report generation — JSON and Markdown evaluation reports.

§Quick start

use oxibonsai_eval::perplexity::PerplexityEvaluator;

let eval = PerplexityEvaluator::new();
// Perfect predictions → PPL ≈ 1.0
let ppl = eval.compute(&[0.0f32; 10]);
assert!((ppl - 1.0).abs() < 1e-5);

Re-exports§

pub use accuracy::AccuracyResult;
pub use accuracy::ExactMatchEvaluator;
pub use accuracy::LogitMcResult;
pub use accuracy::McEvaluator;
pub use accuracy::McLogitEvaluator;
pub use arc::ArcEvaluator;
pub use arc::ArcResult;
pub use arc::ArcSplit;
pub use bleu::corpus_bleu;
pub use bleu::sentence_bleu;
pub use bleu::BleuConfig;
pub use bleu::BleuScore;
pub use bleu::SmoothingMethod;
pub use boolq::BoolQDataset;
pub use boolq::BoolQEvaluator;
pub use boolq::BoolQItem;
pub use boolq::BoolQResult;
pub use bootstrap::bootstrap_ci;
pub use bootstrap::ConfidenceInterval;
pub use calibration::brier_score;
pub use calibration::calibration_all;
pub use calibration::expected_calibration_error;
pub use calibration::nll_from_logits;
pub use calibration::BinStat;
pub use calibration::CalibrationResult;
pub use chrf::chrf;
pub use chrf::chrf_plus_plus;
pub use chrf::chrf_with;
pub use chrf::ChrfScore;
pub use dataset::EvalDataset;
pub use dataset::EvalExample;
pub use dataset::McDataset;
pub use dataset::MultipleChoiceQuestion;
pub use error::EvalError;
pub use gsm8k::Gsm8kEvaluator;
pub use gsm8k::Gsm8kResult;
pub use hellaswag::HellaSwagDataset;
pub use hellaswag::HellaSwagEvaluator;
pub use hellaswag::HellaSwagItem;
pub use hellaswag::HellaSwagResult;
pub use meteor::align_tokens;
pub use meteor::meteor;
pub use meteor::meteor_multi;
pub use meteor::MeteorConfig;
pub use meteor::MeteorScore;
pub use mmlu::MmluEvaluator;
pub use mmlu::MmluResult;
pub use perplexity::PerplexityEvaluator;
pub use perplexity::PerplexityResult;
pub use qa::corpus_em_f1;
pub use qa::exact_match as qa_exact_match;
pub use qa::f1_score as qa_f1_score;
pub use qa::normalize_answer;
pub use qa::normalize_tokens;
pub use qa::score_multi as qa_score_multi;
pub use qa::QaScore;
pub use report::EvalReport;
pub use report::EvalResultEntry;
pub use rouge::ngram_counts;
pub use rouge::tokenize;
pub use rouge::CorpusRouge;
pub use rouge::RougeLScore;
pub use rouge::RougeNScore;
pub use rouge::RougeSScore;
pub use rouge::TokenSeq;
pub use streaming::OnlineAccuracy;
pub use streaming::OnlinePerplexity;
pub use throughput::percentile;
pub use throughput::time_fn;
pub use throughput::ThroughputBenchmark;
pub use throughput::ThroughputResult;
pub use truthfulqa::TruthfulQaDataset;
pub use truthfulqa::TruthfulQaEvaluator;
pub use truthfulqa::TruthfulQaItem;
pub use truthfulqa::TruthfulQaMode;
pub use truthfulqa::TruthfulQaResult;
pub use winogrande::WinoGrandeDataset;
pub use winogrande::WinoGrandeEvaluator;
pub use winogrande::WinoGrandeItem;
pub use winogrande::WinoGrandeResult;

Modules§

accuracy
Accuracy evaluation: multiple-choice (MMLU-style) and exact-match scoring.
arc
ARC (AI2 Reasoning Challenge) evaluation harness.
bleu
BLEU (Bilingual Evaluation Understudy) implementation.
boolq
BoolQ yes/no question answering evaluation harness.
bootstrap
Bootstrap confidence intervals.
calibration
Calibration metrics — ECE, Brier score, NLL.
chrf
chrF and chrF++ — character n-gram F-score (Popović 2015).
dataset
Dataset types and loaders for the evaluation harness.
error
Error types for the evaluation harness.
gsm8k
GSM8K (Grade School Math 8K) evaluation harness.
hellaswag
HellaSwag sentence-completion evaluation harness.
meteor
METEOR (lexical subset) — exact-match only.
mmlu
MMLU (Massive Multitask Language Understanding) evaluation harness.
perplexity
Perplexity evaluator.
qa
SQuAD-style QA evaluation — Exact Match (EM) and token F1.
report
Evaluation report builder.
rouge
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics.
streaming
Streaming / online evaluation state machines.
throughput
Throughput benchmarking for LLM inference.
truthfulqa
TruthfulQA evaluation harness.
winogrande
WinoGrande commonsense reasoning evaluation harness.