benchgecko
Rust SDK for BenchGecko -- the data platform for comparing AI model benchmarks, estimating inference costs, and exploring performance across providers.
Overview
benchgecko gives you typed, idiomatic Rust primitives for working with LLM benchmark data. Build comparison tools, cost calculators, model selectors, and leaderboard UIs without scraping or maintaining your own dataset.
The crate provides:
- Model struct with builder pattern for constructing models with scores and pricing
- BenchmarkCategory enum covering 9 evaluation dimensions (Reasoning, Coding, Knowledge, Instruction, Multilingual, Safety, Long Context, Vision, Agentic)
- ModelTier classification (S through D) based on aggregate performance
- compare_models() for head-to-head analysis across shared categories
- estimate_cost() for calculating inference spend from token counts
- rank_by_category() and filter_by_tier() for leaderboard and filtering operations
- best_value() for finding the most cost-effective model in a set
- Value score computation that balances performance against price
Installation
Add to your Cargo.toml:
[]
= "0.1"
Quick Start
use ;
// Define models with benchmark scores and pricing
let gpt4 = new
.with_context_window
.with_score
.with_score
.with_score
.with_pricing;
let claude = new
.with_context_window
.with_score
.with_score
.with_score
.with_pricing;
// Compare across shared categories
let result = compare_models;
println!;
println!;
// Estimate cost for a request
let cost = estimate_cost.unwrap;
println!;
Tier Classification
Models are classified into tiers based on their average benchmark score:
| Tier | Average Score | Description |
|---|---|---|
| S | 90+ | Elite frontier models |
| A | 80-89 | Strong general-purpose models |
| B | 70-79 | Capable mid-range models |
| C | 60-69 | Budget or older generation |
| D | <60 | Entry-level or legacy |
use ;
let models = vec!;
let elite = filter_by_tier;
println!;
Value Analysis
Find the best performance-per-dollar model using the built-in value score, which divides average benchmark performance by blended token price:
use ;
let models = vec!;
if let Some = best_value
Benchmark Categories
The BenchmarkCategory enum covers the major evaluation dimensions tracked by BenchGecko:
| Category | Typical Benchmarks |
|---|---|
| Reasoning | GSM8K, MATH, ARC-Challenge |
| Coding | HumanEval, MBPP, SWE-bench |
| Knowledge | MMLU, HellaSwag, TriviaQA |
| Instruction | MT-Bench, AlpacaEval |
| Multilingual | MGSM, XLSum |
| Safety | TruthfulQA, BBQ |
| LongContext | RULER, Needle-in-a-Haystack |
| Vision | MMMU, MathVista |
| Agentic | WebArena, SWE-bench |
Data Source
Benchmark data, model metadata, and pricing information are maintained by BenchGecko. Visit the platform for live leaderboards, interactive comparisons, and the full model database covering 300+ models across 50+ providers.
License
MIT