Expand description
Website crawling library that rapidly crawls all pages to gather links via isolated contexts.
Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather millions of pages within seconds.
§How to use Spider
There are a couple of ways to use Spider:
- 
crawl: start concurrently crawling a site. Can be used to send each page (including URL and HTML) to a subscriber for processing, or just to gather links.
- 
scrape: likecrawl, but saves the HTML raw strings to parse after scraping is complete.
§Examples
A simple crawl to index a website:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud");
    website.crawl().await;
    let links = website.get_links();
    for link in links {
        println!("- {:?}", link.as_ref());
    }
}Subscribe to crawl events:
use spider::tokio;
use spider::website::Website;
use tokio::io::AsyncWriteExt;
#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud");
    let mut rx2 = website.subscribe(16).unwrap();
    tokio::spawn(async move {
        let mut stdout = tokio::io::stdout();
        while let Ok(res) = rx2.recv().await {
            let _ = stdout
                .write_all(format!("- {}\n", res.get_url()).as_bytes())
                .await;
        }
    });
    website.crawl().await;
}§Feature flags
- ua_generator: Enables auto generating a random real User-Agent.
- disk: Enables SQLite hybrid disk storage to balance memory usage with no tls.
- disk_native_tls: Enables SQLite hybrid disk storage to balance memory usage with native tls.
- disk_aws: Enables SQLite hybrid disk storage to balance memory usage with aws_tls.
- balance: Enables balancing the CPU and memory to scale more efficiently.
- regex: Enables blacklisting paths with regx.
- firewall: Enables spider_firewall crate to prevent bad websites from crawling.
- decentralized: Enables decentralized processing of IO, requires the spider_worker startup before crawls.
- sync: Subscribe to changes for Page data processing async.
- control: Enables the ability to pause, start, and shutdown crawls on demand.
- full_resources: Enables gathering all content that relates to the domain like css,jss, and etc.
- serde: Enables serde serialization support.
- socks: Enables socks5 proxy support.
- glob: Enables url glob support.
- fs: Enables storing resources to disk for parsing (may greatly increases performance at the cost of temp storage). Enabled by default.
- sitemap: Include sitemap pages in results.
- time: Enables duration tracking per page.
- cache: Enables HTTP caching request to disk.
- cache_mem: Enables HTTP caching request to persist in memory.
- cache_chrome_hybrid: Enables hybrid chrome request caching between HTTP.
- cache_openai: Enables caching the OpenAI request. This can drastically save costs when developing AI workflows.
- chrome: Enables chrome headless rendering, use the env var- CHROME_URLto connect remotely.
- chrome_headed: Enables chrome rendering headful rendering.
- chrome_cpu: Disable gpu usage for chrome browser.
- chrome_stealth: Enables stealth mode to make it harder to be detected as a bot.
- chrome_store_page: Store the page object to perform other actions like taking screenshots conditionally.
- chrome_screenshot: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the ./storage/ directory. Use the env variable- SCREENSHOT_DIRECTORYto adjust the directory.
- chrome_intercept: Allows intercepting network request to speed up processing.
- chrome_headless_new: Use headless=new to launch the chrome instance.
- cookies: Enables cookies storing and setting to use for request.
- real_browser: Enables the ability to bypass protected pages.
- cron: Enables the ability to start cron jobs for the website.
- openai: Enables OpenAI to generate dynamic browser executable scripts. Make sure to use the env var- OPENAI_API_KEY.
- smart: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network request by re-using the content.
- encoding: Enables handling the content with different encodings like Shift_JIS.
- spoof: Spoof HTTP headers for the request.
- headers: Enables the extraction of header information on each retrieved page. Adds a- headersfield to the page struct.
- decentralized_headers: Enables the extraction of suppressed header information of the decentralized processing of IO. This is needed if- headersis set in both spider and spider_worker.
Additional learning resources include:
Re-exports§
- pub extern crate auto_encoder;
- pub extern crate bytes;
- pub extern crate case_insensitive_string;
- pub extern crate hashbrown;
- pub extern crate lazy_static;
- pub extern crate percent_encoding;
- pub extern crate quick_xml;
- pub extern crate reqwest;
- pub extern crate smallvec;
- pub extern crate spider_fingerprint;
- pub extern crate string_concat;
- pub extern crate strum;
- pub extern crate tokio;
- pub extern crate tokio_stream;
- pub extern crate ua_generator;
- pub extern crate url;
- pub use client::Client;
- pub use client::ClientBuilder;
- pub use case_insensitive_string::compact_str;
Modules§
- black_list 
- Black list checking url exist.
- client
- Client interface.
- configuration
- Configuration structure for Website.
- features
- Optional features to use.
- packages
- Internal packages customized.
- page
- A page scraped.
- utils
- Application utils.
- website
- A website to crawl.
Structs§
- CaseInsensitive String 
- case-insensitive string handling
Type Aliases§
- RelativeSelectors 
- The selectors type. The values are held to make sure the relative domain can be crawled upon base redirects.