spider-lib 2.0.4

A Rust-based web scraping framework inspired by Scrapy (Python).
Documentation

spider-lib

A modular Rust web scraping framework inspired by Scrapy.

spider-lib is the facade crate for the workspace. It re-exports core crawling, downloader, middleware, pipeline, utility, and macro APIs so you can start with one dependency and enable only the features you need.

Workspace Crates

  • spider-core: crawler runtime, spider trait, scheduler, builder, state, and stats.
  • spider-downloader: downloader traits and reqwest-based downloader implementation.
  • spider-macro: procedural macros such as #[scraped_item].
  • spider-middleware: retry, rate limiting, robots, cookies, proxy, cache, and user-agent middleware.
  • spider-pipeline: item processing and output pipelines (JSON, JSONL, CSV, SQLite, stream JSON).
  • spider-util: shared request/response/item/error types and helper utilities.

Installation

[dependencies]
spider-lib = "2.0.4"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

serde and serde_json are required when using #[scraped_item].

Quick Start

use spider_lib::prelude::*;

#[scraped_item]
struct QuoteItem {
    text: String,
}

#[derive(Clone, Default)]
struct QuoteState;

struct QuoteSpider;

#[async_trait]
impl Spider for QuoteSpider {
    type Item = QuoteItem;
    type State = QuoteState;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://quotes.toscrape.com/"]
    }

    async fn parse(
        &self,
        _response: Response,
        _state: &Self::State,
    ) -> Result<ParseOutput<Self::Item>, SpiderError> {
        Ok(ParseOutput::new())
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(QuoteSpider).build().await?;
    crawler.start_crawl().await
}

Try the maintained examples:

cargo run --example books
cargo run --example books_live --features live-stats

Feature Flags

Default feature: core.

Middleware features:

  • middleware-cache
  • middleware-autothrottle
  • middleware-proxy
  • middleware-user-agent
  • middleware-robots
  • middleware-cookies

Pipeline features:

  • pipeline-csv
  • pipeline-json
  • pipeline-jsonl
  • pipeline-sqlite
  • pipeline-stream-json

Core features:

  • live-stats: enables in-place terminal stat updates.
  • checkpoint: enables checkpoint/resume support.
  • cookie-store: enables cookie store integration (also enables middleware-cookies).

Example:

[dependencies]
spider-lib = { version = "2.0.4", features = ["middleware-robots", "pipeline-jsonl"] }

Using Workspace Crates Directly

Use spider-lib when you want the integrated API surface.

Use sub-crates directly if you need tighter dependency control or only one subsystem (for example, custom downloader integration with spider-downloader, or utility types from spider-util).

Development

cargo check --workspace --all-targets
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
make check-all-features

Documentation

License

MIT. See LICENSE.