Crate spider

Expand description

Website crawling library that rapidly crawls all pages to gather links via isolated contexts.

Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather millions of pages within seconds.

How to use Spider

There are a couple of ways to use Spider:

Crawl starts crawling a web page and perform most work in isolation.
- crawl is used to crawl concurrently.
Scrape Scrape the page and hold onto the HTML raw string to parse.
- scrape is used to gather the HTML.

Examples

A simple crawl to index a website:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://rsseau.fr");

    website.crawl().await;

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }
}

Feature flags

ua_generator: Enables auto generating a random real User-Agent.
regex: Enables blacklisting paths with regx
jemalloc: Enables the jemalloc memory backend.
decentralized: Enables decentralized processing of IO, requires the spider_worker startup before crawls.
sync: Subscribe to changes for Page data processing async.
budget: Allows setting a crawl budget per path with depth.
control: Enables the ability to pause, start, and shutdown crawls on demand.
full_resources: Enables gathering all content that relates to the domain like css,jss, and etc.
serde: Enables serde serialization support.
socks: Enables socks5 proxy support.
glob: Enables url glob support.
fs: Enables storing resources to disk for parsing (may greatly increases performance at the cost of temp storage). Enabled by default.
sitemap: Include sitemap pages in results.
js: Enables javascript parsing links created with the alpha jsdom crate.
time: Enables duration tracking per page.
cache: Enables HTTP caching request to disk.
cache_mem: Enables HTTP caching request to persist in memory.
chrome: Enables chrome headless rendering, use the env var CHROME_URL to connect remotely [experimental].
chrome_headed: Enables chrome rendering headful rendering [experimental].
chrome_cpu: Disable gpu usage for chrome browser.
chrome_stealth: Enables stealth mode to make it harder to be detected as a bot.
chrome_store_page: Store the page object to perform other actions like taking screenshots conditionally.
chrome_screenshot: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the ./storage/ directory. Use the env variable SCREENSHOT_DIRECTORY to adjust the directory.
chrome_intercept: Allows intercepting network request to speed up processing.
cookies: Enables cookies storing and setting to use for request.
cron: Enables the ability to start cron jobs for the website.
http3: Enables experimental HTTP/3 client.
smart: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network request by re-using the content.
encoding: Enables handling the content with different encodings like Shift_JIS.

Re-exports

pub extern crate bytes;
pub extern crate case_insensitive_string;
pub extern crate compact_str;
pub extern crate fast_html5ever;
pub extern crate hashbrown;
pub extern crate lazy_static;
pub extern crate percent_encoding;
pub extern crate reqwest;
pub extern crate smallvec;
pub extern crate string_concat;
pub extern crate tokio;
pub extern crate tokio_stream;
pub extern crate url;

Modules

black_list
Black list checking url exist.
configuration
Configuration structure for Website.
features
Optional features to use.
packages
Internal packages customized.
page
A page scraped.
utils
Application utils.
website
A website to crawl.

Structs

CaseInsensitiveString
case-insensitive string handling

Type Aliases

Client
The asynchronous Client to make requests with.
ClientBuilder
The asynchronous Client Builder.