codec-corpus
Runtime access to the imazen/codec-corpus test image collection. No data ships with the crate — datasets download on first use and cache locally.
let corpus = new?;
let valid = corpus.get?;
for entry in read_dir?
What it does
- You call
corpus.get("some-folder/optional-subpath") - If the folder isn't cached (or the cache is stale), it downloads via git sparse checkout or HTTP tarball
- Returns the local path. Done.
Downloads use shell git (preferred) with fallback to curl/wget/powershell. No heavy HTTP crate dependencies.
Install
[]
= "1"
Usage
use Corpus;
// network access required
Custom cache location
let corpus = with_cache_root?;
Or via environment variable:
CODEC_CORPUS_CACHE=/mnt/fast-storage
Check what's cached
if corpus.is_cached
for name in corpus.list_cached
Datasets
Any top-level folder in the codec-corpus repo is a valid path. Pass any path into get() — the first component determines the download unit.
Quality calibration
| Dataset | Download | Files | Description | License |
|---|---|---|---|---|
clic2025 |
218 MB | 64 | High-res photos for codec evaluation (~2048px) | Unsplash |
CID22 |
86 MB | 251 | Diverse 512×512 images (209 training + 41 validation) | CC BY-SA 4.0 |
kadid10k |
25 MB | 82 | Pristine IQA reference images, 512×384 | Pixabay |
gb82 |
9.6 MB | 26 | Challenging CC0 photos, 576×576 | CC0 |
gb82-sc |
3.0 MB | 11 | Screenshots and screen content | CC0 |
qoi-benchmark |
39 MB | 18 | Full-page web screenshots | CC0 |
Format conformance
| Dataset | Download | Files | Description | License |
|---|---|---|---|---|
jpeg-conformance |
76 MB | 277 | 41 valid, 116 invalid, 40 non-conformant, 78 crash repros | MIT/IJG+BSD |
jxl |
88 MB | 188 | Conformance, features, edge cases, photographic | BSD-3-Clause |
png-conformance |
1.7 MB | 11 | Real-world PNG edge cases (decompressor crashes, filter bugs) | Various |
pngsuite |
720 KB | 178 | Official PNG conformance tests (all color types, depths) | Freeware |
webp-conformance |
1.3 MB | 230 | 225 valid WebP files + sources | Various |
Decoder robustness
| Dataset | Download | Files | Description | License |
|---|---|---|---|---|
zune |
33 MB | 3,434 | Fuzz corpus (1,836 JPEG + 837 PNG) and test images | MIT/Apache-2.0/Zlib |
image-rs |
4.5 MB | 127 | Multi-format edge cases (BMP, GIF, JPEG, PNG, TIFF, WebP) | MIT |
imageflow |
7.8 MB | 51 | Orientation, format conversion edge cases | Various |
mozjpeg |
1.2 MB | 16 | MozJPEG encoder reference files | IJG + BSD |
Full dataset descriptions and per-file attribution: codec-corpus README.
Cache layout
~/.cache/codec-corpus/v1/ # Linux
~/Library/Caches/codec-corpus/v1/ # macOS
%LOCALAPPDATA%\codec-corpus\v1\ # Windows
.version # "1.0.0" — triggers re-download on version change
.lock # fd-lock for concurrent access
pngsuite/
jpeg-conformance/
...
Different major versions coexist (v1/, v2/). Any crate version change within a major version triggers a re-download to ensure correctness.
CI integration
- uses: actions/cache@v4
with:
path: ~/.cache/codec-corpus
key: corpus-v1
- run: cargo test --release -- --ignored
No special setup. The crate handles downloading; the CI cache avoids re-downloading across runs.
Dependencies
Two Rust crates, both small:
dirs— cross-platform cache directoryfd-lock— file locking for concurrent safety
Archive extraction uses the system tar command. No reqwest, ureq, gix, serde, or toml.
License
The crate itself is Apache-2.0. Each dataset in the corpus has its own license — see the table above and the full license summary.