Codec Comparison Guide
A practical guide to comparing image codecs fairly, with metrics accuracy data, viewing condition considerations, and scientific methodology.
For Codec Developers
Integrating your codec? See INTEGRATION.md for:
- Wiring up encode/decode callbacks
- MozJPEG, Jpegli, AVIF examples
- CI quality regression testing
- Interpreting DSSIM thresholds
Want to improve this tool? See CONTRIBUTING.md. We actively want input from codec developers—you know your domain better than we do.
# Quick start
# Or use the CLI
The codec-eval Library
API-first design: You provide encode/decode callbacks, the library handles everything else.
use ;
let config = builder
.report_dir
.viewing
.quality_levels
.build;
let mut session = new;
session.add_codec;
let report = session.evaluate_corpus?;
Features:
- DSSIM, Butteraugli, and SSIMULACRA2 metrics (PSNR for legacy comparisons)
- Viewing condition modeling (desktop, mobile, retina)
- Automatic corpus download and caching via codec-corpus
- CSV import for third-party benchmark results
- Pareto front analysis and BD-Rate calculation
- JSON/CSV report generation
- Optional: SVG charts, polynomial interpolation
Quick Quality Checks (New in 0.3)
For simple tests without full corpus evaluation:
use ;
// Load images
let reference = new;
let encoded = new;
// Assert quality thresholds
assert_quality?;
// Or use semantic levels
assert_perception_level?;
Unified imports via metrics::prelude:
use *;
// Now have: Dssim, butteraugli, compute_ssimulacra2,
// ImgRef, ImgVec, RGB8, RGBA8, etc.
See INTEGRATION.md for detailed examples.
Quick Reference
| Metric | Correlation with Human Perception | Best For |
|---|---|---|
| PSNR | ~67% | Legacy benchmarks only |
| SSIM/DSSIM | ~82% | Quick approximation |
| Butteraugli | 80-91% | High-quality threshold (score < 1.0) |
| SSIMULACRA2 | 87-98% | Recommended — best overall accuracy |
| VMAF | ~90% | Video, large datasets |
Fair Comparison Principles
Based on Kornel Lesiński's guide:
1. Never Convert Between Lossy Formats
❌ JPEG → WebP → AVIF (each conversion adds artifacts)
✓ PNG/TIFF → WebP
✓ PNG/TIFF → AVIF
Always start from a lossless source. Converting lossy→lossy compounds artifacts and skews results.
2. Standardize Encoder Settings
Don't compare mozjpeg -quality 80 against cjxl -quality 80 — quality scales differ between encoders.
Instead, match by:
- File size — encode to the same byte budget, compare quality
- Quality metric — encode to the same SSIMULACRA2 score, compare file size
3. Use Multiple Images
A single test image can favor certain codecs. Use diverse datasets:
- Kodak — 24 classic benchmark images
- CLIC 2025 — 62 high-resolution images
- CID22 — 250 perceptual quality research images
These datasets are automatically downloaded and cached via the codec-corpus crate:
use Corpus;
let kodak = get_dataset?;
let clic = get_dataset?;
4. Test at Multiple Quality Levels
Codec rankings change across the quality spectrum:
- High quality (SSIMULACRA2 > 80): Differences minimal
- Medium quality (60-80): Most visible differences
- Low quality (< 50): Edge cases, artifacts become dominant
5. Consider Encode/Decode Speed
A codec that's 5% smaller but 100x slower may not be practical. Report:
- Encode time (CPU seconds)
- Decode time (critical for web)
- Memory usage
Quality Metrics Deep Dive
SSIMULACRA2 (Recommended)
The current best metric for perceptual quality assessment.
| Score | Quality Level | Typical Use Case |
|---|---|---|
| < 30 | Poor | Thumbnails, previews |
| 40-50 | Low | Aggressive compression |
| 50-70 | Medium | General web images |
| 70-80 | Good | Photography sites |
| 80-85 | Very High | Professional/archival |
| > 85 | Excellent | Near-lossless |
Accuracy: 87% overall, up to 98% on high-confidence comparisons.
Tool: ssimulacra2_rs
DSSIM
Structural similarity, derived from SSIM but outputs distance (lower = better).
Accuracy: Validated against TID2013 database:
- Spearman correlation: -0.84 to -0.95 (varies by distortion type)
- Best on: Noise, compression artifacts, blur
- Weaker on: Exotic distortions, color shifts
Tool: dssim
| DSSIM Score | Approximate Quality |
|---|---|
| < 0.001 | Visually identical |
| 0.001-0.01 | Excellent |
| 0.01-0.05 | Good |
| 0.05-0.10 | Acceptable |
| > 0.10 | Noticeable artifacts |
Note: Values are not directly comparable between DSSIM versions. Always report version.
Butteraugli
Google's perceptual metric, good for high-quality comparisons.
Accuracy: 80-91% (varies by image type).
Best for: Determining if compression is "transparent" (score < 1.0).
Limitation: Less reliable for heavily compressed images.
VMAF
Netflix's Video Multi-Method Assessment Fusion.
Accuracy: ~90% for video, slightly less for still images.
Best for: Large-scale automated testing, video frames.
PSNR (Avoid)
Peak Signal-to-Noise Ratio — purely mathematical, ignores perception.
Accuracy: ~67% — only slightly better than chance.
Use only: For backwards compatibility with legacy benchmarks.
R-D Angle: Fixed-Frame Rate-Distortion Parameterization
A coordinate system for describing where an encode sits on the rate-distortion tradeoff. Every encode gets an angle measured from the worst corner of a fixed frame. The reference codec's knee (balanced tradeoff point) lands at exactly 45 degrees.
The formula
theta = atan2(quality_norm * aspect, 1.0 - bpp_norm)
Where:
bpp_norm = bpp / bpp_max(how much of the budget you're using)quality_norm = metric_value / metric_max(how much quality you're getting)aspect= quality-axis stretch factor, calibrated so the reference knee = 45 deg
For SSIMULACRA2 (higher is better), quality_norm = s2 / s2_max.
For Butteraugli (lower is better), quality_norm = 1.0 - ba / ba_max.
The fixed frame (web targeting)
| Parameter | Value | Rationale |
|---|---|---|
bpp_max |
4.0 | Practical web ceiling. Few images exceed this. |
s2_max |
100.0 | SSIMULACRA2 scale maximum. |
ba_max |
15.0 | Butteraugli score where quality is essentially destroyed. |
aspect |
1.2568 | Calibrated from CID22-training mozjpeg s2 knee. |
The aspect ratio is derived from the reference knee at (0.7274 bpp, s2=65.10):
aspect = (1 - bpp_knee / bpp_max) / (s2_knee / s2_max)
= (1 - 0.7274 / 4.0) / (65.10 / 100.0)
= 0.81815 / 0.651
= 1.2568
This stretches the quality axis so that the reference knee's quality displacement equals its bpp displacement in the angle computation, producing exactly 45 degrees.
What the angles mean
| Angle | Meaning |
|---|---|
| < 0 deg | Worse than the worst corner. Negative quality (s2 below 0). |
| 0 deg | Worst corner: max bpp, zero quality. |
| ~10-25 deg | Aggressive compression. Thumbnails, heavy lossy. |
| 45 deg | Reference knee. Balanced tradeoff for mozjpeg/CID22. |
| ~52 deg | Ideal diagonal (0 bpp, perfect quality). Theoretical limit. |
| 60-80 deg | Quality-dominated. Large files, diminishing quality returns. |
| 90 deg | No compression: max bpp, max quality. |
| > 90 deg | Over-budget. bpp exceeds the frame ceiling. |
The knee is where the R-D curve transitions from "every extra bit buys meaningful quality" to "you're spending bits for marginal gains." It's the point where the normalized slope of the curve equals 1.0 — equal return on both axes.
Angles below 45 deg are compression-efficient. Angles above 45 deg are quality-focused. The ideal diagonal at ~52 deg represents "perfect quality at zero cost" — unachievable, but it's the geometric ceiling for good encodes.
Calibrated reference numbers
Computed from full corpus evaluation with mozjpeg 4:2:0 progressive, quality sweep 10-98, step 2.
CID22-training (209 images, 512x512):
| Metric | Knee bpp | Knee value | Angle |
|---|---|---|---|
| SSIMULACRA2 | 0.7274 | 65.10 | 45.0 deg |
| Butteraugli | 0.7048 | 4.378 | 47.2 deg |
Disagreement range: 0.70-0.73 bpp (the two metrics nearly agree on where the knee is).
CLIC2025-training (32 images, ~2048px):
| Metric | Knee bpp | Knee value | Angle |
|---|---|---|---|
| SSIMULACRA2 | 0.4623 | 58.95 | 40.0 deg |
| Butteraugli | 0.3948 | 5.192 | 42.4 deg |
Disagreement range: 0.39-0.46 bpp.
CLIC2025 knees are at lower angles because the larger images (~2048px vs 512px) have more pixels per bit — the curve shifts left, and the balanced point is cheaper.
How to determine the angle for a JPEG
Method 1: From corpus averages (fast, approximate)
If you know the bpp and SSIMULACRA2 score for an encode, just compute:
use FixedFrame;
let frame = WEB;
let angle = frame.s2_angle;
This tells you where the encode sits relative to the reference knee. If angle < 45, you're compressing harder than the balanced point. If angle > 45, you're spending more bits than necessary for the quality gain.
For Butteraugli:
let angle = frame.ba_angle;
Method 2: From a per-image Pareto set (precise)
If you have a quality sweep for the specific image (multiple quality settings, each with bpp and metric scores), you can:
- Build the Pareto front from the sweep data.
- Find the knee of the per-image R-D curve using
CorpusAggregatewith a single image. - Compare the per-image knee angle to the corpus reference (45 deg).
use ;
// curve: Vec<(bpp, ssimulacra2, butteraugli)> sorted by bpp
let agg = CorpusAggregate ;
let frame = WEB;
// Find this image's knee
if let Some = agg.ssimulacra2_knee
A per-image knee at 35 deg means the image reaches diminishing returns earlier (compresses well). A knee at 55 deg means the image needs more bits to look good.
Method 3: Angle of a specific encode on the per-image curve
Given a specific encode (one quality setting), compute its angle and compare to the image's own knee:
let frame = WEB;
// The specific encode
let encode_angle = frame.s2_angle;
// The image's knee (from a previous sweep)
let image_knee_angle = image_knee.fixed_angle;
if encode_angle < image_knee_angle else
Dual-metric comparison
Every encode has two angles: one from SSIMULACRA2 and one from Butteraugli. Comparing them reveals what kind of artifacts the codec produces:
theta_s2 > theta_ba: The encode looks better structurally (SSIMULACRA2 is happy) than it does perceptually (Butteraugli sees local contrast issues). Common with aggressive chroma subsampling.theta_s2 < theta_ba: Butteraugli is more forgiving than SSIMULACRA2 at this operating point. The artifacts present are local-contrast-friendly but structurally visible.theta_s2 ≈ theta_ba: Both metrics agree on the quality level. The encode is well-balanced.
Knee detection algorithm
The knee is found by:
- Normalizing the corpus-aggregate R-D curve to [0, 1] on both axes (per-curve normalization, using observed min/max bpp and quality values).
- Computing the normalized slope between adjacent points.
- Finding the point closest to where the normalized slope crosses 1.0 (equal quality gain per bit spent).
- Interpolating linearly between the two bracketing points for a smooth result.
- Mapping the raw (bpp, quality) result back to the fixed-frame angle using the aspect ratio.
The per-curve normalization in step 1 is independent of the fixed frame — it uses the actual observed range of the data. The fixed frame and aspect ratio only enter in step 5, when converting the raw knee coordinates to a comparable angle.
Source files
src/stats/rd_knee.rs— All types, angle computation, knee detection, SVG plottingcrates/codec-compare/src/rd_calibrate.rs— Calibration binary (corpus sweep)
Viewing Conditions
Pixels Per Degree (PPD)
The number of pixels that fit in one degree of visual field. Critical for assessing when compression artifacts become visible.
| PPD | Context | Notes |
|---|---|---|
| 30 | 1080p at arm's length | Casual viewing |
| 60 | 20/20 vision threshold | Most artifacts visible |
| 80 | Average human acuity limit | Diminishing returns above this |
| 120 | 4K at close range | Overkill for most content |
| 159 | iPhone 15 Pro | "Retina" display density |
Formula:
PPD = (distance_inches × resolution_ppi × π) / (180 × viewing_distance_inches)
Device Categories
| Device Type | Typical PPD | Compression Tolerance |
|---|---|---|
| Desktop monitor | 40-80 | Medium quality acceptable |
| Laptop | 80-120 | Higher quality needed |
| Smartphone | 120-160 | Very high quality or artifacts visible |
| 4K TV at 3m | 30-40 | More compression acceptable |
Practical Implications
- Mobile-first sites need higher quality settings (SSIMULACRA2 > 70)
- Desktop sites can use more aggressive compression (SSIMULACRA2 50-70)
- Thumbnails can be heavily compressed regardless of device
- Hero images on retina displays need minimal compression
Scientific Methodology
ITU-R BT.500
The international standard for subjective video/image quality assessment.
Key elements:
- Controlled viewing conditions (luminance, distance, display calibration)
- Non-expert viewers (15-30 recommended)
- 5-grade Mean Opinion Score (MOS):
- 5: Excellent
- 4: Good
- 3: Fair
- 2: Poor
- 1: Bad
- Statistical analysis with confidence intervals
When to use: Final validation of codec choices, publishing research.
Presentation Methods
| Method | Description | Best For |
|---|---|---|
| DSIS | Show reference, then test image | Impairment detection |
| DSCQS | Side-by-side, both unlabeled | Quality comparison |
| 2AFC | "Which is better?" forced choice | Fine discrimination |
| Flicker test | Rapid A/B alternation | Detecting subtle differences |
Human A/B Testing
When metrics aren't enough, subjective testing provides ground truth. But poorly designed studies produce unreliable data.
Study Design
Randomization:
- Randomize presentation order (left/right, first/second)
- Randomize image order across participants
- Balance codec appearances to avoid order effects
Blinding:
- Participants must not know which codec produced which image
- Use neutral labels ("Image A" / "Image B")
- Don't reveal hypothesis until after data collection
Controls:
- Include known quality differences as sanity checks
- Add duplicate pairs to measure participant consistency
- Include "same image" pairs to detect bias
Sample Size
| Comparison Type | Minimum N | Recommended N |
|---|---|---|
| Large quality difference (obvious) | 15 | 20-30 |
| Medium difference (noticeable) | 30 | 50-80 |
| Small difference (subtle) | 80 | 150+ |
Power analysis: For 80% power to detect a 0.5 MOS difference with SD=1.0, you need ~64 participants per condition.
Participant Screening
Pre-study:
- Visual acuity test (corrected 20/40 or better)
- Color vision screening (Ishihara plates)
- Display calibration verification
Exclusion criteria (define before data collection):
- Failed attention checks (> 20% incorrect on known pairs)
- Inconsistent responses (< 60% agreement on duplicate pairs)
- Response time outliers (< 200ms suggests random clicking)
- Incomplete sessions (< 80% of trials)
Attention Checks
Embed these throughout the study:
Types of attention checks:
1. Obvious pairs - Original vs heavily compressed (SSIMULACRA2 < 30)
2. Identical pairs - Same image twice (should report "same" or 50/50 split)
3. Reversed pairs - Same comparison shown twice, order flipped
4. Instructed response - "For this pair, select the LEFT image"
Threshold: Exclude participants who fail > 2 attention checks or > 20% of obvious pairs.
Bias Detection & Correction
Position bias: Tendency to favor left/right or first/second.
- Detect: Chi-square test on position choices across all trials
- Correct: Counter-balance positions; exclude participants with > 70% same-side choices
Fatigue effects: Quality judgments degrade over time.
- Detect: Compare accuracy on attention checks early vs late in session
- Correct: Limit sessions to 15-20 minutes; analyze by time block
Anchoring: First few images bias subsequent judgments.
- Detect: Compare ratings for same image shown early vs late
- Correct: Use practice trials (discard data); randomize order
Central tendency: Avoiding extreme ratings.
- Detect: Histogram of ratings (should use full scale)
- Correct: Use forced choice (2AFC) instead of rating scales
Statistical Analysis
For rating data (MOS):
1. Calculate mean and 95% CI per condition
2. Check normality (Shapiro-Wilk) - often violated
3. Use robust methods:
- Trimmed means (10-20% trim)
- Bootstrap confidence intervals
- Non-parametric tests (Wilcoxon, Kruskal-Wallis)
4. Report effect sizes (Cohen's d, or MOS difference)
For forced choice (2AFC):
1. Calculate preference percentage per pair
2. Binomial test for significance (H0: 50%)
3. Apply multiple comparison correction:
- Bonferroni (conservative)
- Holm-Bonferroni (less conservative)
- Benjamini-Hochberg FDR (for many comparisons)
4. Report: "Codec A preferred 67% of time (p < 0.01, N=100)"
Outlier handling:
1. Define criteria BEFORE analysis (pre-registration)
2. Report both with and without outlier exclusion
3. Use robust statistics that down-weight outliers
4. Never exclude based on "inconvenient" results
Reporting Results
Always include:
- Sample size (N) and exclusion count with reasons
- Confidence intervals, not just p-values
- Effect sizes in meaningful units (MOS points, % preference)
- Individual data points or distributions (not just means)
- Attention check pass rates
- Participant demographics (if relevant to display/vision)
Example:
"N=87 participants completed the study (12 excluded: 8 failed attention checks, 4 incomplete). Codec A was preferred over Codec B in 62% of comparisons (95% CI: 55-69%, p=0.003, binomial test). This corresponds to a mean quality difference of 0.4 MOS points (95% CI: 0.2-0.6)."
Common Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Small N | Underpowered, unreliable | Power analysis before study |
| No attention checks | Can't detect random responders | Embed 10-15% check trials |
| Post-hoc exclusion | Cherry-picking results | Pre-register exclusion criteria |
| Only reporting means | Hides variability | Show distributions + CI |
| Multiple comparisons | Inflated false positives | Apply correction (Bonferroni, FDR) |
| Unbalanced design | Confounds codec with position/order | Full counterbalancing |
| Lab-only testing | May not generalize | Include diverse participants/displays |
Real-World Studies
Cloudinary 2021 (1.4 million opinions):
- JPEG XL: 10-15% better than AVIF at web quality levels
- AVIF: Best for low-bandwidth scenarios
- WebP: Solid middle ground
- All modern codecs beat JPEG by 25-35%
Recommended Workflow
For Quick Comparisons
# 1. Encode to same file size
# 2. Measure with SSIMULACRA2
For Thorough Evaluation
- Gather diverse test images from codec-corpus
- Create quality ladder (10 quality levels per codec)
- Compute metrics for each combination
- Plot rate-distortion curves (file size vs quality)
- Consider encode/decode speed
- Validate with subjective testing if publishing results
Tools & Implementations
SSIMULACRA2
| Implementation | Type | Install | Notes |
|---|---|---|---|
| ssimulacra2_rs | CLI (Rust) | cargo install ssimulacra2_rs |
Recommended |
| ssimulacra2 | Library (Rust) | cargo add ssimulacra2 |
For integration |
| ssimulacra2-cuda | GPU (CUDA) | cargo install ssimulacra2-cuda |
Fast batch processing |
| libjxl | CLI (C++) | Build from source | Original implementation |
# Install CLI
# Usage
# Output: 76.543210 (higher = better, scale 0-100)
DSSIM
| Implementation | Type | Install | Notes |
|---|---|---|---|
| dssim | CLI (Rust) | cargo install dssim |
Recommended |
# Install
# Basic comparison (lower = better)
# Output: 0.02341
# Generate difference visualization
Accuracy: Validated against TID2013 database. Spearman correlation -0.84 to -0.95 depending on distortion type.
Butteraugli
| Implementation | Type | Install | Notes |
|---|---|---|---|
| butteraugli | CLI (C++) | Build from source | Original |
| libjxl | CLI (C++) | Build from source | Includes butteraugli |
VMAF
| Implementation | Type | Install | Notes |
|---|---|---|---|
| libvmaf | CLI + Library | Package manager or build | Official Netflix implementation |
# Ubuntu/Debian
# Usage (via ffmpeg)
Image Processing
| Tool | Purpose | Link |
|---|---|---|
| imageflow | High-performance image processing with quality calibration | Rust + C ABI |
| libvips | Fast image processing library | C + bindings |
| sharp | Node.js image processing (uses libvips) | npm |
References
Methodology
- How to Compare Images Fairly — Kornel Lesiński
- ITU-R BT.500-15 — Subjective quality assessment
- The Netflix Tech Blog: VMAF
Studies
- Image Codec Comparison — Google
- Cloudinary Image Format Study — 1.4M opinions
- Are We Compressed Yet? — Video codec comparison
Test Images
- codec-corpus — Reference images for calibration
Contributing
See CONTRIBUTING.md for the full guide.
We especially want contributions from:
- Codec developers (mozjpeg, jpegli, libavif, webp, etc.) — integration examples, quality scale docs, edge cases
- Metrics researchers — new metrics, calibration data, perception thresholds
- Anyone — docs, tests, bug reports, benchmark results
This project is designed to be community-driven. Fork it, experiment, share what you learn.
License
This guide is released under CC0 — use freely without attribution.