Crate spider

Expand description

Website crawling library that rapidly crawls all pages to gather links via isolated contexts.

Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather tens of thousands of pages within seconds.

How to use Spider

There are a couple of ways to use Spider:

Concurrent is the fastest way to start crawling a web page and typically the most efficient.
- crawl is used to crawl concurrently.
Sequential lets you crawl the web pages one after another respecting delay sequences.
- crawl_sync is used to crawl in sync.
Scrape Scrape the page and hold onto the HTML raw string to parse.
- scrape is used to gather the HTML.

Examples

A simple crawl to index a website:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");

    website.crawl().await;

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }
}

Feature flags

ua_generator: Enables auto generating a random real User-Agent. Enabled by default.
regex: Enables blacklisting paths with regx
jemalloc: Enables the jemalloc memory backend.
decentralized: Enables decentralized processing of IO, requires the [spider_worker] startup before crawls.
control: Enables the ability to pause, start, and shutdown crawls on demand.
full_resources: Enables gathering all content that relates to the domain.
serde: Enables serde serialization support.

Re-exports

pub extern crate compact_str;
pub extern crate hashbrown;
pub extern crate reqwest;
pub extern crate string_concat;
pub extern crate tokio;
pub extern crate url;

Modules

black_list
Black list checking url exist.
configuration
Configuration structure for Website.
packages
Internal packages customized.
page
A page scraped.
utils
Application utils.
website
A website to crawl.

Structs

CaseInsensitiveString
case-insensitive string handling