rusty-llm-jury
A Rust based CLI tool for estimating success rates when using LLM judges for evaluation.
Table of Contents
- Overview
- Installation
- Quick Start
- How It Works
- CLI Reference
- Examples
- Building from Source
- Testing
- Contributing
- License
Overview
When using Large Language Models (LLMs) as judges to evaluate other models or systems, the judge's own biases and errors can significantly impact the reliability of the evaluation. rusty-llm-jury provides a command-line tool to estimate the true success rate of your system by correcting for LLM judge bias using bootstrap confidence intervals.
Installation
From crates.io (when published)
From source
Quick Start
Basic Estimation
# Estimate true success rate with bias correction
# Output:
# Estimated true pass rate: 0.625
# 95% Confidence interval: [0.234, 0.891]
Using Files
# Load data from CSV files
Synthetic Experiments
# Run TPR/TNR sensitivity analysis
How It Works
The tool implements a bias correction method based on the following steps:
- Judge Accuracy Estimation: Calculate the LLM judge's True Positive Rate (TPR) and True Negative Rate (TNR) using labeled test data
- Correction: Apply the Rogan-Gladen correction formula to account for judge bias:
whereθ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)p_obsis the observed pass rate from the judge - Bootstrap Confidence Intervals: Use bootstrap resampling to quantify uncertainty in the estimate
CLI Reference
llm-jury estimate
Estimate true pass rate with bias correction and confidence intervals.
Options:
--test-labels <VALUES>: Comma-separated 0/1 values (human labels on test set)--test-preds <VALUES>: Comma-separated 0/1 values (judge predictions on test set)--unlabeled-preds <VALUES>: Comma-separated 0/1 values (judge predictions on unlabeled data)--test-labels-file <FILE>: Load test labels from CSV file--test-preds-file <FILE>: Load test predictions from CSV file--unlabeled-preds-file <FILE>: Load unlabeled predictions from CSV file--bootstrap-iterations <N>: Number of bootstrap iterations (default: 20000)--confidence-level <LEVEL>: Confidence level between 0 and 1 (default: 0.95)--output <FILE>: Save results to JSON file--format <FORMAT>: Output format: text, json, csv (default: text)
llm-jury synth-experiment
Run synthetic sensitivity experiments.
Options:
--true-failure-rate <RATE>: True failure rate in unlabeled data (default: 0.1)--tpr-range <MIN,MAX>: TPR range to test (default: 0.5,1.0)--tnr-range <MIN,MAX>: TNR range to test (default: 0.5,1.0)--n-points <N>: Number of points in each range (default: 10)--n-test-positive <N>: Number of positive test examples (default: 100)--n-test-negative <N>: Number of negative test examples (default: 100)--n-unlabeled <N>: Number of unlabeled samples (default: 1000)--bootstrap-iterations <N>: Bootstrap iterations (default: 2000)--seed <SEED>: Random seed for reproducibility--output <FILE>: Output file (JSON or CSV based on extension)
Examples
Real-World Usage Pattern
# Step 1: Collect your data
# Step 2: Estimate true success rate
# Step 3: View results
Sensitivity Analysis
# Analyze how estimation varies with judge accuracy
Building from Source
Prerequisites
- Rust 1.70+ (2021 edition)
- Cargo
Build Commands
# Clone repository
# Build release version
# Or using cargo directly
# The binary will be at target/release/llm-jury
Development
# Format code
# Run lints
# Run tests
# All checks
Testing
Run the test suite:
Run with coverage (requires cargo-tarpaulin):
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Inspired by Python judgy package, I am learning during my AI evals course with shreya and hamel.
- The Rogan-Gladen correction method for bias correction in diagnostic tests
- Bootstrap methodology for confidence interval estimation
- The Rust ecosystem for excellent tooling and libraries
Note: This tool assumes that your LLM judge performs better than random chance (TPR + TNR > 1). If your judge's accuracy is too low, the correction method may not be applicable.