ccdown 0.7.0 - Docs.rs

Install

cargo install ccdown

From source

git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .

Pre-built binaries

Grab the latest release for your platform from the releases page.

Usage

1. Download the path manifest for a crawl

ccdown download-paths CC-MAIN-2025-08 warc ./paths

Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table

Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM

2. Download the actual data

ccdown download ./paths/warc.paths.gz ./data

Options

Flag	Description	Default
`-t`	Number of concurrent downloads	`10`
`-r`	Max retries per file	`1000`
`-p`	Show progress bars	off
`-f`	Flat file output (no directory structure)	off
`-n`	Numbered output (for Ungoliant Pipeline)	off
`-s`	Abort on unrecoverable errors (401, 403, 404)	off

Example

ccdown download -p -t 5 ./paths/warc.paths.gz ./data

Note: Keep threads at 10 or below. Too many concurrent requests will get you 403'd by the server, and those errors are unrecoverable.

Fetch a single WARC record

Fetch one record by byte offset (e.g. a PDF pointed at by a columnar-index or FinePDFs row) without downloading the whole WARC. Sends an HTTP Range request and stops after one gzip member:

ccdown fetch-record crawl-data/CC-MAIN-2025-08/segments/.../x.warc.gz 12345 -o out.pdf

Flag	Description	Default
`-o`	Write the record body to this file	required
`-r`	Max retries	`10`
`--max-bytes`	Size cap (compressed and decompressed)	`104857600` (100 MiB)

The library API is ccdown::fetch_record(file_path, offset, &RecordOptions), returning WARC headers, HTTP headers, and the body. Targets WARC-Type: response records. (Python bindings for this are not exposed yet.)

Install

pip install ccdown

Usage

from ccdown import Client

client = Client(threads=10, retries=1000, progress=True)

# Download the path manifest for a crawl
client.paths("CC-MAIN-2025-08", "warc").to("./paths")

# Download the actual data
client.download("./paths/warc.paths.gz").to("./data")

# Flat file output (no directory structure)
client.download("./paths/warc.paths.gz").files_only().to("./data")

# Numbered output + strict mode (abort on 401/403/404)
client.download("./paths/warc.paths.gz").numbered().strict().to("./data")

API

Client(threads=10, retries=1000, progress=False) — Create a client with shared config.

client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.

client.download(path_file) — Returns a builder with chainable options:

.files_only() — flatten directory structure
.numbered() — enumerate output files (for Ungoliant)
.strict() — abort on unrecoverable HTTP errors
.to(dst) — execute the download

License

MIT OR Apache-2.0