Crate spider

source ·
Expand description

Website crawling library that rapidly crawls all pages to gather links via isolated contexts.

Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather tens of thousands of pages within seconds.

How to use Spider

There are a couple of ways to use Spider:

  • Concurrent is the fastest way to start crawling a web page and typically the most efficient.
    • crawl is used to crawl concurrently.
  • Sequential lets you crawl the web pages one after another respecting delay sequences.
  • Scrape Scrape the page and hold onto the HTML raw string to parse.
    • scrape is used to gather the HTML.

Examples

A simple crawl to index a website:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");

    website.crawl().await;

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }
}

Feature flags

  • ua_generator: Enables auto generating a random real User-Agent. Enabled by default.
  • regex: Enables blacklisting paths with regx
  • jemalloc: Enables the jemalloc memory backend.
  • decentralized: Enables decentralized processing of IO, requires the [spider_worker] startup before crawls.
  • control: Enables the ability to pause, start, and shutdown crawls on demand.
  • full_resources: Enables gathering all content that relates to the domain.
  • serde: Enables serde serialization support.

Re-exports

Modules

Structs