Skip to main content

Crate spider

Crate spider 

Source
Expand description

Website crawling library that rapidly crawls all pages to gather links via isolated contexts.

Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather millions of pages within seconds.

§How to use Spider

There are a couple of ways to use Spider:

  • crawl: start concurrently crawling a site. Can be used to send each page (including URL and HTML) to a subscriber for processing, or just to gather links.

  • scrape: like crawl, but saves the HTML raw strings to parse after scraping is complete.

§Examples

A simple crawl to index a website:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud");

    website.crawl().await;

    let links = website.get_links();

    for link in links {
        println!("- {:?}", link.as_ref());
    }
}

Subscribe to crawl events:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud");
    let mut rx2 = website.subscribe(16);

    tokio::spawn(async move {
        while let Ok(res) = rx2.recv().await {
            println!("- {}", res.get_url());
        }
    });

    website.crawl().await;
}

§Spider Cloud Integration

Use Spider Cloud for anti-bot bypass, proxy rotation, and high-throughput data collection. Enable the spider_cloud feature and set your API key. Set return_format to "markdown" for clean LLM-ready output:

use spider::configuration::{SpiderCloudConfig, SpiderCloudMode, SpiderCloudReturnFormat};
use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let config = SpiderCloudConfig::new("YOUR_API_KEY")
        .with_mode(SpiderCloudMode::Smart)
        .with_return_format(SpiderCloudReturnFormat::Markdown);

    let mut website: Website = Website::new("https://example.com")
        .with_limit(10)
        .with_spider_cloud_config(config)
        .build()
        .unwrap();

    let mut rx = website.subscribe(16);

    tokio::spawn(async move {
        while let Ok(page) = rx.recv().await {
            let url = page.get_url();
            let markdown = page.get_content();
            let status = page.status_code;

            println!("[{status}] {url}\n---\n{markdown}\n");
        }
    });

    website.crawl().await;
    website.unsubscribe();
}

§Chrome Rendering

Enable the chrome feature to render JavaScript-heavy pages. Use the env var CHROME_URL to connect to a remote instance:

use spider::tokio;
use spider::website::Website;

#[tokio::main]
async fn main() {
    let mut website: Website = Website::new("https://spider.cloud")
        .with_limit(10)
        .with_chrome_intercept(Default::default())
        .build()
        .unwrap();

    let mut rx = website.subscribe(16);

    tokio::spawn(async move {
        while let Ok(page) = rx.recv().await {
            println!("{} - {}", page.get_url(), page.get_html_bytes_u8().len());
        }
    });

    website.crawl().await;
}

§Feature flags

§Core

  • ua_generator: Enables auto generating a random real User-Agent.
  • regex: Enables blacklisting paths with regex.
  • glob: Enables url glob support.
  • fs: Enables storing resources to disk for parsing (may greatly increase performance at the cost of temp storage). Enabled by default.
  • sitemap: Include sitemap pages in results.
  • time: Enables duration tracking per page.
  • encoding: Enables handling the content with different encodings like Shift_JIS.
  • serde: Enables serde serialization support.
  • sync: Subscribe to changes for Page data processing async.
  • control: Enables the ability to pause, start, and shutdown crawls on demand.
  • full_resources: Enables gathering all content that relates to the domain like CSS, JS, and etc.
  • cookies: Enables cookies storing and setting to use for request.
  • spoof: Spoof HTTP headers for the request.
  • headers: Enables the extraction of header information on each retrieved page. Adds a headers field to the page struct.
  • balance: Enables balancing the CPU and memory to scale more efficiently.
  • cron: Enables the ability to start cron jobs for the website.
  • tracing: Enables tokio tracing support for diagnostics.
  • cowboy: Enables full concurrency mode with no throttle.
  • llm_json: Enables LLM-friendly JSON parsing.
  • page_error_status_details: Enables storing detailed error status information on pages.
  • extra_information: Enables extra page metadata collection.
  • cmd: Enables tokio process support.
  • io_uring: Enables Linux io_uring support for async I/O (default on Linux).
  • simd: Enables SIMD-accelerated JSON parsing.
  • inline-more: More aggressive function inlining for performance (may increase compile times).

§Storage

  • disk: Enables SQLite hybrid disk storage to balance memory usage with no TLS.
  • disk_native_tls: Enables SQLite hybrid disk storage to balance memory usage with native TLS.
  • disk_aws: Enables SQLite hybrid disk storage to balance memory usage with AWS TLS.

§Caching

  • cache: Enables HTTP caching request to disk.
  • cache_mem: Enables HTTP caching request to persist in memory.
  • cache_openai: Enables caching the OpenAI request. This can drastically save costs when developing AI workflows.
  • cache_gemini: Enables caching Gemini AI requests.
  • cache_chrome_hybrid: Enables hybrid Chrome + HTTP caching to disk.
  • cache_chrome_hybrid_mem: Enables hybrid Chrome + HTTP caching in memory.

§Chrome / Browser

  • chrome: Enables Chrome headless rendering, use the env var CHROME_URL to connect remotely.
  • chrome_headed: Enables Chrome headful rendering.
  • chrome_cpu: Disable GPU usage for Chrome browser.
  • chrome_stealth: Enables stealth mode to make it harder to be detected as a bot.
  • chrome_store_page: Store the page object to perform other actions like taking screenshots conditionally.
  • chrome_screenshot: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the ./storage/ directory. Use the env variable SCREENSHOT_DIRECTORY to adjust the directory.
  • chrome_intercept: Allows intercepting network requests to speed up processing.
  • chrome_headless_new: Use headless=new to launch the Chrome instance.
  • chrome_simd: Enables SIMD optimizations for Chrome message parsing.
  • chrome_tls_connection: Enables TLS connection support for Chrome.
  • chrome_serde_stacker: Enables serde stacker for deeply nested Chrome messages.
  • chrome_remote_cache: Enables remote Chrome caching in memory.
  • chrome_remote_cache_disk: Enables remote Chrome caching to disk.
  • chrome_remote_cache_mem: Enables remote Chrome caching in memory only.
  • adblock: Enables adblock support for Chrome to block ads during rendering.
  • real_browser: Enables the ability to bypass protected pages.
  • smart: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network requests by re-using the content.

§WebDriver

  • webdriver: Enables WebDriver support via thirtyfour. Use with chromedriver, geckodriver, or Selenium.
  • webdriver_headed: Enables WebDriver headful mode.
  • webdriver_stealth: Enables stealth mode for WebDriver.
  • webdriver_chrome: WebDriver with Chrome browser.
  • webdriver_firefox: WebDriver with Firefox browser.
  • webdriver_edge: WebDriver with Edge browser.
  • webdriver_screenshot: Enables screenshots via WebDriver.

§AI / LLM

  • openai: Enables OpenAI to generate dynamic browser executable scripts. Make sure to use the env var OPENAI_API_KEY.
  • gemini: Enables Gemini AI to generate dynamic browser executable scripts. Make sure to use the env var GEMINI_API_KEY.

§Spider Cloud

  • spider_cloud: Enables Spider Cloud integration for anti-bot bypass, proxy rotation, and API-based crawling.

§Agent

  • agent: Enables the spider_agent multimodal autonomous agent.
  • agent_openai: Agent with OpenAI provider.
  • agent_chrome: Agent with Chrome browser context.
  • agent_webdriver: Agent with WebDriver context.
  • agent_skills: Agent with dynamic skill system for web automation challenges.
  • agent_skills_s3: Agent skills with S3 storage.
  • agent_fs: Agent with filesystem support for temp storage.
  • agent_search_serper: Agent with Serper search integration.
  • agent_search_brave: Agent with Brave Search integration.
  • agent_search_bing: Agent with Bing Search integration.
  • agent_search_tavily: Agent with Tavily search integration.
  • agent_full: Full agent with all features enabled.

§Search

  • search: Enables search provider base.
  • search_serper: Enables Serper search integration.
  • search_brave: Enables Brave Search integration.
  • search_bing: Enables Bing Search integration.
  • search_tavily: Enables Tavily search integration.

§Networking

  • socks: Enables SOCKS5 proxy support.
  • wreq: Enables the wreq HTTP client alternative with built-in impersonation.

§Distributed

  • decentralized: Enables decentralized processing of IO, requires the spider_worker startup before crawls.
  • decentralized_headers: Enables the extraction of suppressed header information of the decentralized processing of IO. This is needed if headers is set in both spider and spider_worker.
  • firewall: Enables spider_firewall crate to prevent bad websites from crawling.

Additional learning resources include:

Re-exports§

pub extern crate async_job;
pub extern crate async_openai;
pub extern crate auto_encoder;
pub extern crate bytes;
pub extern crate case_insensitive_string;
pub extern crate flexbuffers;
pub extern crate gemini_rust;
pub extern crate hashbrown;
pub extern crate http_cache_reqwest;
pub extern crate lazy_static;
pub extern crate moka;
pub extern crate percent_encoding;
pub extern crate quick_xml;
pub extern crate reqwest;
pub extern crate reqwest_middleware;
pub extern crate serde;
pub extern crate smallvec;
pub extern crate spider_agent;
pub extern crate spider_fingerprint;
pub extern crate string_concat;
pub extern crate strum;
pub extern crate tokio;
pub extern crate tokio_stream;
pub extern crate ua_generator;
pub extern crate url;
pub use client::Client;
pub use client::ClientBuilder;
pub use traits::Crawler;
pub use traits::PageData;
pub use features::search;search
pub use features::search_providers;search
pub use case_insensitive_string::compact_str;
pub use chromiumoxide;chrome

Modules§

agentagent
Re-export agent types from spider_agent crate. Agent module re-exports from spider_agent crate.
black_listregex
Black list checking url exist with Regex.
client
Client interface.
configuration
Configuration structure for Website.
features
Optional features to use.
packages
Internal packages customized.
page
A page scraped.
retry_strategy
Configurable retry strategy for advanced retry logic. Configurable retry strategy for advanced retry logic.
traits
Trait abstractions for core types. Trait abstractions for spider’s core types.
utils
Application utils.
website
A website to crawl.

Structs§

CaseInsensitiveString
case-insensitive string handling

Type Aliases§

RelativeSelectors
The selectors type. The values are held to make sure the relative domain can be crawled upon base redirects.