codec-corpus
Runtime access to the imazen/codec-corpus test image collection and third-party fuzz/conformance corpora. No data ships with the crate — datasets download on first use and cache locally.
let corpus = new?;
let valid = corpus.get?;
for entry in read_dir?
What it does
- You call
corpus.get("some-folder/optional-subpath") - If the folder isn't cached (or the cache is stale), it downloads via git sparse checkout or HTTP tarball
- Returns the local path. Done.
Downloads use shell git (preferred) with fallback to curl/wget/powershell. No heavy HTTP crate dependencies.
Install
[]
= "1"
Usage
imazen/codec-corpus (default)
Any path without a recognized third-party prefix downloads from imazen/codec-corpus:
use Corpus;
// network access required
Third-party sources (built-in registry)
Prefixed paths resolve to known external sources automatically:
let corpus = new.unwrap;
// OSS-Fuzz backup corpora (downloaded as ZIP from GCS)
let ossfuzz_jpeg = corpus.get.unwrap;
let ossfuzz_png = corpus.get.unwrap;
let ossfuzz_jxl = corpus.get.unwrap;
// dvyukov/go-fuzz-corpus subfolders (git sparse checkout)
let gofuzz_gif = corpus.get.unwrap;
let gofuzz_png = corpus.get.unwrap;
let gofuzz_jpeg = corpus.get.unwrap;
// libjpeg-turbo fuzz seed corpus (git sparse checkout)
let ljt_fuzz = corpus.get.unwrap;
// image-rs/image test images (git sparse checkout)
let imagers = corpus.get.unwrap;
Third-party sources are cached under third-party/ in the cache root and re-fetched if the .fetched marker is older than 7 days (configurable via with_max_age()).
Ad-hoc sources
Fetch from arbitrary locations without needing them in the built-in registry:
let corpus = new.unwrap;
// Arbitrary GitHub repo subfolder
let path = corpus.github_repo.unwrap;
// Download and cache a ZIP URL
let path = corpus.zip_url.unwrap;
// Download and cache a tarball URL
let path = corpus.tar_url.unwrap;
// Point at a local directory (validates existence, no download)
let path = corpus.local_path.unwrap;
Custom cache location
let corpus = with_cache_root?;
Or via environment variable:
CODEC_CORPUS_CACHE=/mnt/fast-storage
Cache staleness
Third-party sources use a .fetched timestamp marker. By default, sources older than 7 days are re-fetched:
use Duration;
let corpus = new?
.with_max_age; // 30 days
Check what's cached
if corpus.is_cached
for name in corpus.list_cached
Built-in third-party registry
| Prefix | Source | Download method |
|---|---|---|
oss-fuzz/libjpeg-turbo |
OSS-Fuzz backup corpus (cjpeg_fuzzer) | ZIP from GCS |
oss-fuzz/libpng |
OSS-Fuzz backup corpus (read_fuzzer) | ZIP from GCS |
oss-fuzz/libjxl |
OSS-Fuzz backup corpus (djxl_fuzzer) | ZIP from GCS |
go-fuzz-corpus/gif |
dvyukov/go-fuzz-corpus gif/corpus | git sparse checkout |
go-fuzz-corpus/png |
dvyukov/go-fuzz-corpus png/corpus | git sparse checkout |
go-fuzz-corpus/jpeg |
dvyukov/go-fuzz-corpus jpeg/corpus | git sparse checkout |
libjpeg-turbo-fuzz |
libjpeg-turbo/fuzz seed_corpus | git sparse checkout |
image-rs/{subpath} |
image-rs/image {subpath} | git sparse checkout |
Datasets (imazen/codec-corpus)
Any top-level folder in the codec-corpus repo is a valid path. Pass any path into get() — the first component determines the download unit.
Quality calibration
| Dataset | Download | Files | Description | License |
|---|---|---|---|---|
clic2025 |
218 MB | 64 | High-res photos for codec evaluation (~2048px) | Unsplash |
CID22 |
86 MB | 251 | Diverse 512x512 images (209 training + 41 validation) | CC BY-SA 4.0 |
kadid10k |
25 MB | 82 | Pristine IQA reference images, 512x384 | Pixabay |
gb82 |
9.6 MB | 26 | Challenging CC0 photos, 576x576 | CC0 |
gb82-sc |
3.0 MB | 11 | Screenshots and screen content | CC0 |
qoi-benchmark |
39 MB | 18 | Full-page web screenshots | CC0 |
Format conformance
| Dataset | Download | Files | Description | License |
|---|---|---|---|---|
bmp-conformance |
2.1 MB | 126 | BMP spec conformance (valid, invalid, non-conformant, crash repros) | MIT |
jpeg-conformance |
76 MB | 277 | 41 valid, 116 invalid, 40 non-conformant, 78 crash repros | MIT/IJG+BSD |
jxl |
88 MB | 188 | Conformance, features, edge cases, photographic | BSD-3-Clause |
png-conformance |
1.7 MB | 11 | Real-world PNG edge cases (decompressor crashes, filter bugs) | Various |
pngsuite |
720 KB | 178 | Official PNG conformance tests (all color types, depths) | Freeware |
webp-conformance |
1.3 MB | 230 | 225 valid WebP files + sources | Various |
Decoder robustness
| Dataset | Download | Files | Description | License |
|---|---|---|---|---|
zune |
33 MB | 3,434 | Fuzz corpus (1,836 JPEG + 837 PNG) and test images | MIT/Apache-2.0/Zlib |
image-rs |
4.5 MB | 127 | Multi-format edge cases (BMP, GIF, JPEG, PNG, TIFF, WebP) | MIT |
imageflow |
7.8 MB | 51 | Orientation, format conversion edge cases | Various |
mozjpeg |
1.2 MB | 16 | MozJPEG encoder reference files | IJG + BSD |
Full dataset descriptions and per-file attribution: codec-corpus README.
Cache layout
~/.cache/codec-corpus/v1/ # Linux
~/Library/Caches/codec-corpus/v1/ # macOS
%LOCALAPPDATA%\codec-corpus\v1\ # Windows
.version # "1.0.3" — triggers re-download on version change
.lock # fd-lock for concurrent access
pngsuite/
jpeg-conformance/
third-party/ # Third-party sources (not version-gated)
oss-fuzz__libpng/
.fetched # Unix timestamp — re-fetch if >7 days old
...files...
go-fuzz-corpus__gif/
github__owner__repo__path/
...
Different major versions coexist (v1/, v2/). imazen/codec-corpus datasets re-download on any crate version change. Third-party sources use time-based staleness (default 7 days) and survive version changes.
CI integration
- uses: actions/cache@v4
with:
path: ~/.cache/codec-corpus
key: corpus-v1
- run: cargo test --release -- --ignored
No special setup. The crate handles downloading; the CI cache avoids re-downloading across runs.
Using codec-corpus from WASM
codec-corpus compiles for wasm32-wasip1 (and wasm32-unknown-unknown). On WASM, downloads are not available since the crate relies on host subprocesses (git, curl). Instead, prefetch on the host and preopen the cache for the WASM runtime.
1. Prefetch on the host (native)
# Download the datasets you need into a local cache directory
CODEC_CORPUS_CACHE=/corpus
Or just run your native tests once — Corpus::new() populates ~/.cache/codec-corpus/ automatically.
2. Run WASM tests with a preopened cache
# Map the host cache directory into the WASM sandbox
CODEC_CORPUS_CACHE=/corpus \
CARGO_TARGET_WASM32_WASIP1_RUNNER="wasmtime --dir /corpus::/corpus" \
The --dir host_path::guest_path flag gives the WASM process read access to the host directory at the guest mount point.
3. In your WASM code
// Point at the preopened cache root
let corpus = with_cache_root.unwrap;
// These work — they only read the filesystem
assert!;
let datasets = corpus.list_cached;
let path = corpus.get.unwrap;
// This returns Error::DownloadUnsupported — by design
let err = corpus.get.unwrap_err;
// "downloads are not supported on this platform; dataset 'not-yet-cached' must be pre-cached on the host"
Calling get() for an uncached dataset on WASM returns Error::DownloadUnsupported (not NetworkUnavailable). This is intentional — downloads must happen host-side.
Dependencies
Four Rust crates, all small:
dirs— cross-platform cache directoryfd-lock— file locking for concurrent safety (native only, not on WASM)serde+serde_json— manifest parsing for R2/JSONL corpus sources
Archive extraction uses system commands (tar, unzip, powershell). No reqwest, ureq, gix, or toml.
License
The crate itself is Apache-2.0. Each dataset in the corpus has its own license — see the table above and the full license summary.