rusty-llm-jury

A Rust based CLI tool for estimating success rates when using LLM judges for evaluation.

Overview
Installation
Quick Start
How It Works
CLI Reference
Examples
Building from Source
Testing
Contributing
License

Overview

When using Large Language Models (LLMs) as judges to evaluate other models or systems, the judge's own biases and errors can significantly impact the reliability of the evaluation. rusty-llm-jury provides a command-line tool to estimate the true success rate of your system by correcting for LLM judge bias using bootstrap confidence intervals.

Installation

From crates.io (when published)

cargo install llm-jury

From source

git clone https://github.com/udapy/rusty-llm-jury.git
cd rusty-llm-jury
cargo install --path .

Quick Start

Basic Estimation

# Estimate true success rate with bias correction
llm-jury estimate \
  --test-labels "1,1,0,0,1,0,1,0" \
  --test-preds "1,0,0,1,1,0,1,0" \
  --unlabeled-preds "1,1,0,1,0,1,0,1" \
  --bootstrap-iterations 20000 \
  --confidence-level 0.95

# Output:
# Estimated true pass rate: 0.625
# 95% Confidence interval: [0.234, 0.891]

Using Files

# Load data from CSV files
llm-jury estimate \
  --test-labels-file test_labels.csv \
  --test-preds-file test_preds.csv \
  --unlabeled-preds-file unlabeled_preds.csv

Synthetic Experiments

# Run TPR/TNR sensitivity analysis
llm-jury synth-experiment \
  --true-failure-rate 0.1 \
  --tpr-range 0.5,0.95 \
  --tnr-range 0.5,0.95 \
  --n-points 10 \
  --output results.json

How It Works

The tool implements a bias correction method based on the following steps:

Judge Accuracy Estimation: Calculate the LLM judge's True Positive Rate (TPR) and True Negative Rate (TNR) using labeled test data
Correction: Apply the Rogan-Gladen correction formula to account for judge bias:
```
θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)
```
where p_obs is the observed pass rate from the judge
Bootstrap Confidence Intervals: Use bootstrap resampling to quantify uncertainty in the estimate

CLI Reference

`llm-jury estimate`

Estimate true pass rate with bias correction and confidence intervals.

Options:

--test-labels <VALUES>: Comma-separated 0/1 values (human labels on test set)
--test-preds <VALUES>: Comma-separated 0/1 values (judge predictions on test set)
--unlabeled-preds <VALUES>: Comma-separated 0/1 values (judge predictions on unlabeled data)
--test-labels-file <FILE>: Load test labels from CSV file
--test-preds-file <FILE>: Load test predictions from CSV file
--unlabeled-preds-file <FILE>: Load unlabeled predictions from CSV file
--bootstrap-iterations <N>: Number of bootstrap iterations (default: 20000)
--confidence-level <LEVEL>: Confidence level between 0 and 1 (default: 0.95)
--output <FILE>: Save results to JSON file
--format <FORMAT>: Output format: text, json, csv (default: text)

`llm-jury synth-experiment`

Run synthetic sensitivity experiments.

Options:

--true-failure-rate <RATE>: True failure rate in unlabeled data (default: 0.1)
--tpr-range <MIN,MAX>: TPR range to test (default: 0.5,1.0)
--tnr-range <MIN,MAX>: TNR range to test (default: 0.5,1.0)
--n-points <N>: Number of points in each range (default: 10)
--n-test-positive <N>: Number of positive test examples (default: 100)
--n-test-negative <N>: Number of negative test examples (default: 100)
--n-unlabeled <N>: Number of unlabeled samples (default: 1000)
--bootstrap-iterations <N>: Bootstrap iterations (default: 2000)
--seed <SEED>: Random seed for reproducibility
--output <FILE>: Output file (JSON or CSV based on extension)

Examples

Real-World Usage Pattern

# Step 1: Collect your data
echo "1,0,1,1,0,0,1,0" > test_labels.csv      # Human evaluation
echo "1,0,0,1,1,0,1,0" > test_preds.csv       # LLM judge on same data
echo "1,1,0,1,0,1,0,1,1,0" > unlabeled.csv    # LLM judge on target data

# Step 2: Estimate true success rate
llm-jury estimate \
  --test-labels-file test_labels.csv \
  --test-preds-file test_preds.csv \
  --unlabeled-preds-file unlabeled.csv \
  --format json \
  --output results.json

# Step 3: View results
cat results.json

Sensitivity Analysis

# Analyze how estimation varies with judge accuracy
llm-jury synth-experiment \
  --true-failure-rate 0.2 \
  --tpr-range 0.6,0.95 \
  --tnr-range 0.6,0.95 \
  --n-points 15 \
  --seed 42 \
  --output sensitivity_analysis.json

Building from Source

Prerequisites

Rust 1.70+ (2021 edition)
Cargo

Build Commands

# Clone repository
git clone https://github.com/ai-evals-course/rusty-llm-jury.git
cd rusty-llm-jury

# Build release version
make build

# Or using cargo directly
cargo build --release

# The binary will be at target/release/llm-jury

Development

# Format code
make fmt

# Run lints
make clippy

# Run tests
make test

# All checks
make check

Testing

Run the test suite:

cargo test

Run with coverage (requires cargo-tarpaulin):

cargo install cargo-tarpaulin
cargo tarpaulin --out html

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Inspired by Python judgy package, I am learning during my AI evals course with shreya and hamel.
The Rogan-Gladen correction method for bias correction in diagnostic tests
Bootstrap methodology for confidence interval estimation
The Rust ecosystem for excellent tooling and libraries

Note: This tool assumes that your LLM judge performs better than random chance (TPR + TNR > 1). If your judge's accuracy is too low, the correction method may not be applicable.

rusty-llm-jury 0.1.0