ccdown 0.6.1

A polite and user-friendly downloader for Common Crawl data.
ccdown-0.6.1 is not a library.
Visit the last successful build: ccdown-0.6.5

ccdown

A polite downloader for Common Crawl data, written in Rust.

Install

From crates.io

cargo install ccdown

From source

git clone https://github.com/4thel00z/ccdown.git
cd ccdown
cargo install --path .

Pre-built binaries

Grab the latest release for your platform from the releases page.

Usage

1. Download the path manifest for a crawl

ccdown download-paths CC-MAIN-2025-08 warc ./paths

Supported subsets: segment, warc, wat, wet, robotstxt, non200responses, cc-index, cc-index-table

Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM

2. Download the actual data

ccdown download ./paths/warc.paths.gz ./data

Options

Flag Description Default
-t Number of concurrent downloads 10
-r Max retries per file 1000
-p Show progress bars off
-f Flat file output (no directory structure) off
-n Numbered output (for Ungoliant Pipeline) off

Example with progress and 5 threads

ccdown download -p -t 5 ./paths/warc.paths.gz ./data

Note: Keep threads at 10 or below. Too many concurrent requests will get you 403'd by the server, and those errors are unrecoverable.

License

MIT OR Apache-2.0