Crate oxibonsai_eval

Expand description

§oxibonsai-eval

Model evaluation harness for OxiBonsai.

Provides utilities for:

Perplexity — measures how well a model predicts held-out text.
MMLU-style multiple choice — accuracy on four-option questions (both string-parsing McEvaluator and logit-based accuracy::McLogitEvaluator).
Exact match — token-level accuracy for text-generation tasks.
BLEU — corpus / sentence BLEU with 1..N orders and smoothing.
chrF / chrF++ — character n-gram F-score (Popović 2015).
METEOR (lexical) — exact-match-only METEOR.
SQuAD F1 + EM — standard SQuAD 1.1 normalisation.
Calibration — ECE, Brier score, NLL (numerically stable).
Bootstrap CIs — seed-deterministic percentile intervals.
Streaming / online — running perplexity and accuracy counters.
Throughput benchmarking — tokens-per-second and latency statistics.
Dataset loading — JSONL-based EvalDataset and McDataset.
Report generation — JSON and Markdown evaluation reports.

§Quick start

use oxibonsai_eval::perplexity::PerplexityEvaluator;

let eval = PerplexityEvaluator::new();
// Perfect predictions → PPL ≈ 1.0
let ppl = eval.compute(&[0.0f32; 10]);
assert!((ppl - 1.0).abs() < 1e-5);

Re-exports§

pub use accuracy::AccuracyResult;
pub use accuracy::ExactMatchEvaluator;
pub use accuracy::LogitMcResult;
pub use accuracy::McEvaluator;
pub use accuracy::McLogitEvaluator;
pub use arc::ArcEvaluator;
pub use arc::ArcResult;
pub use arc::ArcSplit;
pub use bleu::corpus_bleu;
pub use bleu::sentence_bleu;
pub use bleu::BleuConfig;
pub use bleu::BleuScore;
pub use bleu::SmoothingMethod;
pub use boolq::BoolQDataset;
pub use boolq::BoolQEvaluator;
pub use boolq::BoolQItem;
pub use boolq::BoolQResult;
pub use bootstrap::bootstrap_ci;
pub use bootstrap::ConfidenceInterval;
pub use calibration::brier_score;
pub use calibration::calibration_all;
pub use calibration::expected_calibration_error;
pub use calibration::nll_from_logits;
pub use calibration::BinStat;
pub use calibration::CalibrationResult;
pub use chrf::chrf;
pub use chrf::chrf_plus_plus;
pub use chrf::chrf_with;
pub use chrf::ChrfScore;
pub use dataset::EvalDataset;
pub use dataset::EvalExample;
pub use dataset::McDataset;
pub use dataset::MultipleChoiceQuestion;
pub use error::EvalError;
pub use gsm8k::Gsm8kEvaluator;
pub use gsm8k::Gsm8kResult;
pub use hellaswag::HellaSwagDataset;
pub use hellaswag::HellaSwagEvaluator;
pub use hellaswag::HellaSwagItem;
pub use hellaswag::HellaSwagResult;
pub use meteor::align_tokens;
pub use meteor::meteor;
pub use meteor::meteor_multi;
pub use meteor::MeteorConfig;
pub use meteor::MeteorScore;
pub use mmlu::MmluEvaluator;
pub use mmlu::MmluResult;
pub use perplexity::PerplexityEvaluator;
pub use perplexity::PerplexityResult;
pub use qa::corpus_em_f1;
pub use qa::exact_match as qa_exact_match;
pub use qa::f1_score as qa_f1_score;
pub use qa::normalize_answer;
pub use qa::normalize_tokens;
pub use qa::score_multi as qa_score_multi;
pub use qa::QaScore;
pub use report::EvalReport;
pub use report::EvalResultEntry;
pub use rouge::ngram_counts;
pub use rouge::tokenize;
pub use rouge::CorpusRouge;
pub use rouge::RougeLScore;
pub use rouge::RougeNScore;
pub use rouge::RougeSScore;
pub use rouge::TokenSeq;
pub use streaming::OnlineAccuracy;
pub use streaming::OnlinePerplexity;
pub use throughput::percentile;
pub use throughput::time_fn;
pub use throughput::ThroughputBenchmark;
pub use throughput::ThroughputResult;
pub use truthfulqa::TruthfulQaDataset;
pub use truthfulqa::TruthfulQaEvaluator;
pub use truthfulqa::TruthfulQaItem;
pub use truthfulqa::TruthfulQaMode;
pub use truthfulqa::TruthfulQaResult;
pub use winogrande::WinoGrandeDataset;
pub use winogrande::WinoGrandeEvaluator;
pub use winogrande::WinoGrandeItem;
pub use winogrande::WinoGrandeResult;

Modules§

accuracy: Accuracy evaluation: multiple-choice (MMLU-style) and exact-match scoring.
arc: ARC (AI2 Reasoning Challenge) evaluation harness.
bleu: BLEU (Bilingual Evaluation Understudy) implementation.
boolq: BoolQ yes/no question answering evaluation harness.
bootstrap: Bootstrap confidence intervals.
calibration: Calibration metrics — ECE, Brier score, NLL.
chrf: chrF and chrF++ — character n-gram F-score (Popović 2015).
dataset: Dataset types and loaders for the evaluation harness.
error: Error types for the evaluation harness.
gsm8k: GSM8K (Grade School Math 8K) evaluation harness.
hellaswag: HellaSwag sentence-completion evaluation harness.
meteor: METEOR (lexical subset) — exact-match only.
mmlu: MMLU (Massive Multitask Language Understanding) evaluation harness.
perplexity: Perplexity evaluator.
qa: SQuAD-style QA evaluation — Exact Match (EM) and token F1.
report: Evaluation report builder.
rouge: ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics.
streaming: Streaming / online evaluation state machines.
throughput: Throughput benchmarking for LLM inference.
truthfulqa: TruthfulQA evaluation harness.
winogrande: WinoGrande commonsense reasoning evaluation harness.

Crate oxibonsai_eval

Crate oxibonsai_eval Copy item path

§oxibonsai-eval

§Quick start

Re-exports§

Modules§

Crate oxibonsai_eval