ccdown-0.6.1 is not a library.
Visit the last successful build:
ccdown-0.6.5
ccdown
A polite downloader for Common Crawl data, written in Rust.
Install
From crates.io
From source
Pre-built binaries
Grab the latest release for your platform from the releases page.
Usage
1. Download the path manifest for a crawl
Supported subsets: segment, warc, wat, wet, robotstxt, non200responses, cc-index, cc-index-table
Crawl format: CC-MAIN-YYYY-WW or CC-NEWS-YYYY-MM
2. Download the actual data
Options
| Flag | Description | Default |
|---|---|---|
-t |
Number of concurrent downloads | 10 |
-r |
Max retries per file | 1000 |
-p |
Show progress bars | off |
-f |
Flat file output (no directory structure) | off |
-n |
Numbered output (for Ungoliant Pipeline) | off |
Example with progress and 5 threads
Note: Keep threads at 10 or below. Too many concurrent requests will get you
403'd by the server, and those errors are unrecoverable.
License
MIT OR Apache-2.0