url-crawler
A configurable parallel web crawler, designed to crawl a website for content.
Example
extern crate url_crawler;
use Arc;
use *;
/// Function for filtering content in the crawler before a HEAD request.
///
/// Only allow directory entries, and files that have the `deb` extension.
Output
The folowing includes two snippets from the combined output.
...
Html {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/s/system76-cudnn-9.2/"
}
Html {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/t/tensorflow-1.9-cuda-9.2/"
}
Html {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/source/t/tensorflow-1.9-cpu/"
}
...
File {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/binary-amd64/a/atom/atom_1.30.0_amd64.deb",
content_type: "application/octet-stream",
length: 87689398,
modified: Some(
2018-09-25T17:54:39+00:00
)
}
File {
url: "http://apt.pop-os.org/proprietary/pool/bionic/main/binary-amd64/a/atom/atom_1.31.1_amd64.deb",
content_type: "application/octet-stream",
length: 90108020,
modified: Some(
2018-10-03T22:29:15+00:00
)
}
...