# Codec Comparison Guide
A practical guide to comparing image codecs fairly, with metrics accuracy data, viewing condition considerations, and scientific methodology.
## For Codec Developers
**Integrating your codec?** See **[INTEGRATION.md](INTEGRATION.md)** for:
- Wiring up encode/decode callbacks
- MozJPEG, Jpegli, AVIF examples
- CI quality regression testing
- Interpreting DSSIM thresholds
**Want to improve this tool?** See **[CONTRIBUTING.md](CONTRIBUTING.md)**. We actively want input from codec developers—you know your domain better than we do.
```bash
# Quick start
cargo add codec-eval --git https://github.com/imazen/codec-eval
# Or use the CLI
cargo install --git https://github.com/imazen/codec-eval codec-eval-cli
```
### The codec-eval Library
**API-first design**: You provide encode/decode callbacks, the library handles everything else.
```rust
use codec_eval::{EvalSession, EvalConfig, ViewingCondition};
let config = EvalConfig::builder()
.report_dir("./reports")
.viewing(ViewingCondition::desktop())
.quality_levels(vec![60.0, 80.0, 95.0])
.build();
let mut session = EvalSession::new(config);
session.add_codec("my-codec", "1.0", Box::new(|image, request| {
my_codec::encode(image, request.quality)
}));
let report = session.evaluate_corpus("./test_images")?;
```
**Features:**
- DSSIM and PSNR metrics (SSIMULACRA2 planned)
- Viewing condition modeling (desktop, mobile, retina)
- Corpus management with sparse checkout for large repos
- CSV import for third-party benchmark results
- Pareto front analysis and BD-Rate calculation
- JSON/CSV report generation
---
## Quick Reference
| PSNR | ~67% | Legacy benchmarks only |
| SSIM/DSSIM | ~82% | Quick approximation |
| Butteraugli | 80-91% | High-quality threshold (score < 1.0) |
| SSIMULACRA2 | 87-98% | **Recommended** — best overall accuracy |
| VMAF | ~90% | Video, large datasets |
---
## Fair Comparison Principles
Based on [Kornel Lesiński's guide](https://kornel.ski/en/faircomparison):
### 1. Never Convert Between Lossy Formats
```
❌ JPEG → WebP → AVIF (each conversion adds artifacts)
✓ PNG/TIFF → WebP
✓ PNG/TIFF → AVIF
```
Always start from a lossless source. Converting lossy→lossy compounds artifacts and skews results.
### 2. Standardize Encoder Settings
Don't compare `mozjpeg -quality 80` against `cjxl -quality 80` — quality scales differ between encoders.
Instead, match by:
- **File size** — encode to the same byte budget, compare quality
- **Quality metric** — encode to the same SSIMULACRA2 score, compare file size
### 3. Use Multiple Images
A single test image can favor certain codecs. Use diverse datasets:
- [Kodak](https://github.com/imazen/codec-corpus) — 24 classic benchmark images
- [CLIC 2025](https://github.com/imazen/codec-corpus) — 62 high-resolution images
- [CID22](https://github.com/imazen/codec-corpus) — 250 perceptual quality research images
### 4. Test at Multiple Quality Levels
Codec rankings change across the quality spectrum:
- **High quality** (SSIMULACRA2 > 80): Differences minimal
- **Medium quality** (60-80): Most visible differences
- **Low quality** (< 50): Edge cases, artifacts become dominant
### 5. Consider Encode/Decode Speed
A codec that's 5% smaller but 100x slower may not be practical. Report:
- Encode time (CPU seconds)
- Decode time (critical for web)
- Memory usage
---
## Quality Metrics Deep Dive
### SSIMULACRA2 (Recommended)
The current best metric for perceptual quality assessment.
| < 30 | Poor | Thumbnails, previews |
| 40-50 | Low | Aggressive compression |
| 50-70 | Medium | General web images |
| 70-80 | Good | Photography sites |
| 80-85 | Very High | Professional/archival |
| > 85 | Excellent | Near-lossless |
**Accuracy**: 87% overall, up to 98% on high-confidence comparisons.
**Tool**: [ssimulacra2_rs](https://github.com/rust-av/ssimulacra2)
```bash
ssimulacra2 original.png compressed.jpg
```
### DSSIM
Structural similarity, derived from SSIM but outputs distance (lower = better).
**Accuracy**: Validated against [TID2013](http://www.ponomarenko.info/tid2013.htm) database:
- Spearman correlation: -0.84 to -0.95 (varies by distortion type)
- Best on: Noise, compression artifacts, blur
- Weaker on: Exotic distortions, color shifts
**Tool**: [dssim](https://github.com/kornelski/dssim)
```bash
dssim original.png compressed.jpg
```
| < 0.001 | Visually identical |
| 0.001-0.01 | Excellent |
| 0.01-0.05 | Good |
| 0.05-0.10 | Acceptable |
| > 0.10 | Noticeable artifacts |
**Note**: Values are not directly comparable between DSSIM versions. Always report version.
### Butteraugli
Google's perceptual metric, good for high-quality comparisons.
**Accuracy**: 80-91% (varies by image type).
**Best for**: Determining if compression is "transparent" (score < 1.0).
**Limitation**: Less reliable for heavily compressed images.
### VMAF
Netflix's Video Multi-Method Assessment Fusion.
**Accuracy**: ~90% for video, slightly less for still images.
**Best for**: Large-scale automated testing, video frames.
### PSNR (Avoid)
Peak Signal-to-Noise Ratio — purely mathematical, ignores perception.
**Accuracy**: ~67% — only slightly better than chance.
**Use only**: For backwards compatibility with legacy benchmarks.
---
## Viewing Conditions
### Pixels Per Degree (PPD)
The number of pixels that fit in one degree of visual field. Critical for assessing when compression artifacts become visible.
| 30 | 1080p at arm's length | Casual viewing |
| 60 | 20/20 vision threshold | Most artifacts visible |
| 80 | Average human acuity limit | Diminishing returns above this |
| 120 | 4K at close range | Overkill for most content |
| 159 | iPhone 15 Pro | "Retina" display density |
**Formula**:
```
PPD = (distance_inches × resolution_ppi × π) / (180 × viewing_distance_inches)
```
### Device Categories
| Desktop monitor | 40-80 | Medium quality acceptable |
| Laptop | 80-120 | Higher quality needed |
| Smartphone | 120-160 | Very high quality or artifacts visible |
| 4K TV at 3m | 30-40 | More compression acceptable |
### Practical Implications
1. **Mobile-first sites** need higher quality settings (SSIMULACRA2 > 70)
2. **Desktop sites** can use more aggressive compression (SSIMULACRA2 50-70)
3. **Thumbnails** can be heavily compressed regardless of device
4. **Hero images** on retina displays need minimal compression
---
## Scientific Methodology
### ITU-R BT.500
The international standard for subjective video/image quality assessment.
**Key elements**:
- Controlled viewing conditions (luminance, distance, display calibration)
- Non-expert viewers (15-30 recommended)
- 5-grade Mean Opinion Score (MOS):
- 5: Excellent
- 4: Good
- 3: Fair
- 2: Poor
- 1: Bad
- Statistical analysis with confidence intervals
**When to use**: Final validation of codec choices, publishing research.
### Presentation Methods
| **DSIS** | Show reference, then test image | Impairment detection |
| **DSCQS** | Side-by-side, both unlabeled | Quality comparison |
| **2AFC** | "Which is better?" forced choice | Fine discrimination |
| **Flicker test** | Rapid A/B alternation | Detecting subtle differences |
---
## Human A/B Testing
When metrics aren't enough, subjective testing provides ground truth. But poorly designed studies produce unreliable data.
### Study Design
**Randomization**:
- Randomize presentation order (left/right, first/second)
- Randomize image order across participants
- Balance codec appearances to avoid order effects
**Blinding**:
- Participants must not know which codec produced which image
- Use neutral labels ("Image A" / "Image B")
- Don't reveal hypothesis until after data collection
**Controls**:
- Include known quality differences as sanity checks
- Add duplicate pairs to measure participant consistency
- Include "same image" pairs to detect bias
### Sample Size
| Large quality difference (obvious) | 15 | 20-30 |
| Medium difference (noticeable) | 30 | 50-80 |
| Small difference (subtle) | 80 | 150+ |
**Power analysis**: For 80% power to detect a 0.5 MOS difference with SD=1.0, you need ~64 participants per condition.
### Participant Screening
**Pre-study**:
- Visual acuity test (corrected 20/40 or better)
- Color vision screening (Ishihara plates)
- Display calibration verification
**Exclusion criteria** (define before data collection):
- Failed attention checks (> 20% incorrect on known pairs)
- Inconsistent responses (< 60% agreement on duplicate pairs)
- Response time outliers (< 200ms suggests random clicking)
- Incomplete sessions (< 80% of trials)
### Attention Checks
Embed these throughout the study:
```
Types of attention checks:
1. Obvious pairs - Original vs heavily compressed (SSIMULACRA2 < 30)
2. Identical pairs - Same image twice (should report "same" or 50/50 split)
3. Reversed pairs - Same comparison shown twice, order flipped
4. Instructed response - "For this pair, select the LEFT image"
```
**Threshold**: Exclude participants who fail > 2 attention checks or > 20% of obvious pairs.
### Bias Detection & Correction
**Position bias**: Tendency to favor left/right or first/second.
- Detect: Chi-square test on position choices across all trials
- Correct: Counter-balance positions; exclude participants with > 70% same-side choices
**Fatigue effects**: Quality judgments degrade over time.
- Detect: Compare accuracy on attention checks early vs late in session
- Correct: Limit sessions to 15-20 minutes; analyze by time block
**Anchoring**: First few images bias subsequent judgments.
- Detect: Compare ratings for same image shown early vs late
- Correct: Use practice trials (discard data); randomize order
**Central tendency**: Avoiding extreme ratings.
- Detect: Histogram of ratings (should use full scale)
- Correct: Use forced choice (2AFC) instead of rating scales
### Statistical Analysis
**For rating data (MOS)**:
```
1. Calculate mean and 95% CI per condition
2. Check normality (Shapiro-Wilk) - often violated
3. Use robust methods:
- Trimmed means (10-20% trim)
- Bootstrap confidence intervals
- Non-parametric tests (Wilcoxon, Kruskal-Wallis)
4. Report effect sizes (Cohen's d, or MOS difference)
```
**For forced choice (2AFC)**:
```
1. Calculate preference percentage per pair
2. Binomial test for significance (H0: 50%)
3. Apply multiple comparison correction:
- Bonferroni (conservative)
- Holm-Bonferroni (less conservative)
- Benjamini-Hochberg FDR (for many comparisons)
4. Report: "Codec A preferred 67% of time (p < 0.01, N=100)"
```
**Outlier handling**:
```
1. Define criteria BEFORE analysis (pre-registration)
2. Report both with and without outlier exclusion
3. Use robust statistics that down-weight outliers
4. Never exclude based on "inconvenient" results
```
### Reporting Results
**Always include**:
- Sample size (N) and exclusion count with reasons
- Confidence intervals, not just p-values
- Effect sizes in meaningful units (MOS points, % preference)
- Individual data points or distributions (not just means)
- Attention check pass rates
- Participant demographics (if relevant to display/vision)
**Example**:
> "N=87 participants completed the study (12 excluded: 8 failed attention checks, 4 incomplete).
> Codec A was preferred over Codec B in 62% of comparisons (95% CI: 55-69%, p=0.003, binomial test).
> This corresponds to a mean quality difference of 0.4 MOS points (95% CI: 0.2-0.6)."
### Common Pitfalls
| Pitfall | Problem | Solution |
|---------|---------|----------|
| Small N | Underpowered, unreliable | Power analysis before study |
| No attention checks | Can't detect random responders | Embed 10-15% check trials |
| Post-hoc exclusion | Cherry-picking results | Pre-register exclusion criteria |
| Only reporting means | Hides variability | Show distributions + CI |
| Multiple comparisons | Inflated false positives | Apply correction (Bonferroni, FDR) |
| Unbalanced design | Confounds codec with position/order | Full counterbalancing |
| Lab-only testing | May not generalize | Include diverse participants/displays |
### Real-World Studies
**Cloudinary 2021** (1.4 million opinions):
- JPEG XL: 10-15% better than AVIF at web quality levels
- AVIF: Best for low-bandwidth scenarios
- WebP: Solid middle ground
- All modern codecs beat JPEG by 25-35%
---
## Recommended Workflow
### For Quick Comparisons
```bash
# 1. Encode to same file size
convert source.png -define webp:target-size=50000 output.webp
cjxl source.png output.jxl --target_size 50000
# 2. Measure with SSIMULACRA2
ssimulacra2 source.png output.webp
ssimulacra2 source.png output.jxl
```
### For Thorough Evaluation
1. **Gather diverse test images** from [codec-corpus](https://github.com/imazen/codec-corpus)
2. **Create quality ladder** (10 quality levels per codec)
3. **Compute metrics** for each combination
4. **Plot rate-distortion curves** (file size vs quality)
5. **Consider encode/decode speed**
6. **Validate with subjective testing** if publishing results
---
## Tools & Implementations
### SSIMULACRA2
| [ssimulacra2_rs](https://crates.io/crates/ssimulacra2_rs) | CLI (Rust) | `cargo install ssimulacra2_rs` | **Recommended** |
| [ssimulacra2](https://crates.io/crates/ssimulacra2) | Library (Rust) | `cargo add ssimulacra2` | For integration |
| [ssimulacra2-cuda](https://crates.io/crates/ssimulacra2-cuda) | GPU (CUDA) | `cargo install ssimulacra2-cuda` | Fast batch processing |
| [libjxl](https://github.com/libjxl/libjxl) | CLI (C++) | Build from source | Original implementation |
```bash
# Install CLI
cargo install ssimulacra2_rs
# Usage
ssimulacra2_rs original.png compressed.jpg
# Output: 76.543210 (higher = better, scale 0-100)
```
### DSSIM
| [dssim](https://github.com/kornelski/dssim) | CLI (Rust) | `cargo install dssim` | **Recommended** |
```bash
# Install
cargo install dssim
# Basic comparison (lower = better)
dssim original.png compressed.jpg
# Output: 0.02341
# Generate difference visualization
dssim -o difference.png original.png compressed.jpg
```
**Accuracy**: Validated against TID2013 database. Spearman correlation -0.84 to -0.95 depending on distortion type.
### Butteraugli
| [butteraugli](https://github.com/google/butteraugli) | CLI (C++) | Build from source | Original |
| [libjxl](https://github.com/libjxl/libjxl) | CLI (C++) | Build from source | Includes butteraugli |
### VMAF
| [libvmaf](https://github.com/Netflix/vmaf) | CLI + Library | Package manager or build | Official Netflix implementation |
```bash
# Ubuntu/Debian
apt install libvmaf-dev
# Usage (via ffmpeg)
ffmpeg -i original.mp4 -i compressed.mp4 -lavfi libvmaf -f null -
```
### Image Processing
| [imageflow](https://github.com/imazen/imageflow) | High-performance image processing with quality calibration | Rust + C ABI |
| [libvips](https://github.com/libvips/libvips) | Fast image processing library | C + bindings |
| [sharp](https://github.com/lovell/sharp) | Node.js image processing (uses libvips) | npm |
---
## References
### Methodology
- [How to Compare Images Fairly](https://kornel.ski/en/faircomparison) — Kornel Lesiński
- [ITU-R BT.500-15](https://www.itu.int/rec/R-REC-BT.500) — Subjective quality assessment
- [The Netflix Tech Blog: VMAF](https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652)
### Studies
- [Image Codec Comparison](https://storage.googleapis.com/avif-comparison/index.html) — Google
- [Cloudinary Image Format Study](https://cloudinary.com/blog/image-format-study) — 1.4M opinions
- [Are We Compressed Yet?](https://arewecompressedyet.com/) — Video codec comparison
### Test Images
- [codec-corpus](https://github.com/imazen/codec-corpus) — Reference images for calibration
---
## Contributing
See **[CONTRIBUTING.md](CONTRIBUTING.md)** for the full guide.
**We especially want contributions from:**
- **Codec developers** (mozjpeg, jpegli, libavif, webp, etc.) — integration examples, quality scale docs, edge cases
- **Metrics researchers** — new metrics, calibration data, perception thresholds
- **Anyone** — docs, tests, bug reports, benchmark results
This project is designed to be community-driven. Fork it, experiment, share what you learn.
---
## License
This guide is released under [CC0](https://creativecommons.org/publicdomain/zero/1.0/) — use freely without attribution.