Install
From source
Pre-built binaries
Grab the latest release for your platform from the releases page.
Usage
1. Download the path manifest for a crawl
Supported subsets: segment warc wat wet robotstxt non200responses cc-index cc-index-table
Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM
2. Download the actual data
Options
| Flag | Description | Default |
|---|---|---|
-t |
Number of concurrent downloads | 10 |
-r |
Max retries per file | 1000 |
-p |
Show progress bars | off |
-f |
Flat file output (no directory structure) | off |
-n |
Numbered output (for Ungoliant Pipeline) | off |
-s |
Abort on unrecoverable errors (401, 403, 404) | off |
Example
Note: Keep threads at 10 or below. Too many concurrent requests will get you
403'd by the server, and those errors are unrecoverable.
Install
Usage
=
# Download the path manifest for a crawl
# Download the actual data
# Flat file output (no directory structure)
# Numbered output + strict mode (abort on 401/403/404)
API
Client(threads=10, retries=1000, progress=False) — Create a client with shared config.
client.paths(snapshot, data_type) — Returns a builder. Call .to(dst) to download the path manifest.
client.download(path_file) — Returns a builder with chainable options:
.files_only()— flatten directory structure.numbered()— enumerate output files (for Ungoliant).strict()— abort on unrecoverable HTTP errors.to(dst)— execute the download
License
MIT OR Apache-2.0