Expand description
Website crawling library that rapidly crawls all pages to gather links via isolated contexts.
Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather tens of thousands of pages within seconds.
How to use Spider
There are a couple of ways to use Spider:
- Concurrent is the fastest way to start crawling a web page and
typically the most efficient.
crawlis used to crawl concurrently.
- Sequential lets you crawl the web pages one after another respecting delay sequences.
crawl_syncis used to crawl in sync.
- Scrape Scrape the page and hold onto the HTML raw string to parse.
scrapeis used to gather the HTML.
Examples
A simple crawl to index a website:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
website.crawl().await;
let links = website.get_links();
for link in links {
println!("- {:?}", link.as_ref());
}
}Feature flags
ua_generator: Enables auto generating a random real User-Agent. Enabled by default.regex: Enables blacklisting paths with regxjemalloc: Enables the jemalloc memory backend.decentralized: Enables decentralized processing of IO, requires the [spider_worker] startup before crawls.control: Enabled the ability to pause, start, and shutdown crawls on demand.full_resources: Enables gathering all content that relates to the domain.serde: Enables serde serialization support.
Re-exports
pub extern crate compact_str;pub extern crate hashbrown;pub extern crate reqwest;pub extern crate string_concat;pub extern crate tokio;pub extern crate url;
Modules
- Black list checking url exist.
- Configuration structure for
Website. - Internal packages customized.
- A page scraped.
- Application utils.
- A website to crawl.
Structs
- case-insensitive string handling