Expand description
Website crawling library that rapidly crawls all pages to gather links via isolated contexts.
Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather millions of pages within seconds.
How to use Spider
There are a couple of ways to use Spider:
- Crawl starts crawling a web page and
perform most work in isolation.
crawlis used to crawl concurrently.
- Scrape Scrape the page and hold onto the HTML raw string to parse.
scrapeis used to gather the HTML.
Examples
A simple crawl to index a website:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://rsseau.fr");
website.crawl().await;
let links = website.get_links();
for link in links {
println!("- {:?}", link.as_ref());
}
}Feature flags
ua_generator: Enables auto generating a random real User-Agent.regex: Enables blacklisting paths with regxjemalloc: Enables the jemalloc memory backend.decentralized: Enables decentralized processing of IO, requires the spider_worker startup before crawls.sync: Subscribe to changes for Page data processing async.budget: Allows setting a crawl budget per path with depth.control: Enables the ability to pause, start, and shutdown crawls on demand.full_resources: Enables gathering all content that relates to the domain like css,jss, and etc.serde: Enables serde serialization support.socks: Enables socks5 proxy support.glob: Enables url glob support.fs: Enables storing resources to disk for parsing (may greatly increases performance at the cost of temp storage). Enabled by default.sitemap: Include sitemap pages in results.js: Enables javascript parsing links created with the alpha jsdom crate.time: Enables duration tracking per page.cache: Enables HTTP caching request to disk.cache_mem: Enables HTTP caching request to persist in memory.chrome: Enables chrome headless rendering, use the env varCHROME_URLto connect remotely [experimental].chrome_headed: Enables chrome rendering headful rendering [experimental].chrome_cpu: Disable gpu usage for chrome browser.chrome_stealth: Enables stealth mode to make it harder to be detected as a bot.chrome_store_page: Store the page object to perform other actions like taking screenshots conditionally.chrome_screenshot: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the ./storage/ directory. Use the env variableSCREENSHOT_DIRECTORYto adjust the directory.chrome_intercept: Allows intercepting network request to speed up processing.cookies: Enables cookies storing and setting to use for request.cron: Enables the ability to start cron jobs for the website.http3: Enables experimental HTTP/3 client.smart: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network request by re-using the content.encoding: Enables handling the content with different encodings like Shift_JIS.
Re-exports
pub extern crate bytes;pub extern crate case_insensitive_string;pub extern crate compact_str;pub extern crate fast_html5ever;pub extern crate hashbrown;pub extern crate lazy_static;pub extern crate percent_encoding;pub extern crate reqwest;pub extern crate smallvec;pub extern crate string_concat;pub extern crate tokio;pub extern crate tokio_stream;pub extern crate url;
Modules
- Black list checking url exist.
- Configuration structure for
Website. - Optional features to use.
- Internal packages customized.
- A page scraped.
- Application utils.
- A website to crawl.
Structs
- case-insensitive string handling
Type Aliases
- The asynchronous Client to make requests with.
- The asynchronous Client Builder.