spider-lib 3.0.0

A Rust-based web scraping framework inspired by Scrapy (Python).
Documentation

spider-lib

A modular Rust web scraping framework inspired by Scrapy.

spider-lib is the facade crate for this workspace. It re-exports crawler runtime, downloader, middleware, pipelines, utility types, and macros so you can start with one dependency and enable only the features you need.

Table of Contents

Workspace Crates

  • spider-core: crawler runtime, spider trait, scheduler, builder, state, and stats.
  • spider-downloader: downloader traits and reqwest-based downloader implementation.
  • spider-macro: procedural macros such as #[scraped_item].
  • spider-middleware: retry, rate limiting, robots, cookies, proxy, cache, and user-agent middleware.
  • spider-pipeline: item processing and output pipelines (JSON, JSONL, CSV, SQLite, stream JSON).
  • spider-util: shared request/response/item/error types and helper utilities.

Architecture at a Glance

Spider generates initial URLs and parses responses, while the crawler orchestrates request execution and data flow.

Spider::start_urls
  -> Scheduler
  -> Downloader (default: reqwest)
  -> Middleware chain (before/after download)
  -> Spider::parse(Response) -> ParseOutput { requests, items }
  -> Pipeline chain (transform/validate/dedup/export)

Installation

[dependencies]
spider-lib = "3.0.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

serde and serde_json are required when using #[scraped_item].

Quick Start

use spider_lib::prelude::*;

#[scraped_item]
struct QuoteItem {
    text: String,
    author: String,
}

#[derive(Clone, Default)]
struct QuoteState;

struct QuoteSpider;

#[async_trait]
impl Spider for QuoteSpider {
    type Item = QuoteItem;
    type State = QuoteState;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://quotes.toscrape.com/"]
    }

    async fn parse(
        &self,
        _response: Response,
        _state: &Self::State,
    ) -> Result<ParseOutput<Self::Item>, SpiderError> {
        Ok(ParseOutput::new())
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(QuoteSpider)
        .add_middleware(RateLimitMiddleware::default())
        .add_middleware(RetryMiddleware::new())
        .add_pipeline(ConsolePipeline::new())
        .build()
        .await?;

    crawler.start_crawl().await
}

Try maintained examples:

cargo run --example books
cargo run --example books_live --features live-stats

Downloader Usage

Default downloader

CrawlerBuilder uses a reqwest-based downloader by default, so no extra setup is needed for most projects.

Build a Custom Downloader

Use a custom downloader when you need custom transport behavior (special auth, alternate HTTP stack, tracing, etc.).

Trait contract (Downloader):

#[async_trait]
pub trait Downloader: Send + Sync + 'static {
    type Client: Send + Sync;
    async fn download(&self, request: Request) -> Result<Response, SpiderError>;
    fn client(&self) -> &Self::Client;
}
use async_trait::async_trait;
use spider_lib::prelude::*;

struct SignedDownloader {
    client: reqwest::Client,
}

#[async_trait]
impl Downloader for SignedDownloader {
    type Client = reqwest::Client;

    async fn download(&self, request: Request) -> Result<Response, SpiderError> {
        let _req = request;
        // Sign request, execute HTTP call, map response.
        todo!()
    }

    fn client(&self) -> &Self::Client {
        &self.client
    }
}

Runtime integration:

let crawler = CrawlerBuilder::new(MySpider)
    .downloader(SignedDownloader {
        client: reqwest::Client::new(),
    })
    .build()
    .await?;

For trait details and lower-level integration, see spider-downloader.

Middleware Usage

Core middleware (always available):

  • RateLimitMiddleware: controls request throughput.
  • RetryMiddleware: retries failed/transient requests.
  • RefererMiddleware: populates Referer header for follow-up requests.

Optional middleware (feature-gated):

  • HttpCacheMiddleware (middleware-cache)
  • AutoThrottleMiddleware (middleware-autothrottle)
  • ProxyMiddleware (middleware-proxy)
  • UserAgentMiddleware (middleware-user-agent)
  • RobotsTxtMiddleware (middleware-robots)
  • CookieMiddleware (middleware-cookies)
let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(RateLimitMiddleware::default())
    .add_middleware(RetryMiddleware::new())
    .add_middleware(RefererMiddleware::new())
    .build()
    .await?;

Build a Custom Middleware

Trait contract (Middleware<C>):

#[async_trait]
pub trait Middleware<C: Send + Sync>: Any + Send + Sync + 'static {
    fn name(&self) -> &str;
    async fn process_request(
        &mut self,
        client: &C,
        request: Request,
    ) -> Result<MiddlewareAction<Request>, SpiderError> {
        Ok(MiddlewareAction::Continue(request))
    }
    async fn process_response(
        &mut self,
        response: Response,
    ) -> Result<MiddlewareAction<Response>, SpiderError> {
        Ok(MiddlewareAction::Continue(response))
    }
    async fn handle_error(
        &mut self,
        request: &Request,
        error: &SpiderError,
    ) -> Result<MiddlewareAction<Request>, SpiderError> {
        Err(error.clone())
    }
}

Minimal implementation (override only what you need):

use spider_lib::prelude::*;

struct HeaderMiddleware;

#[async_trait]
impl<C: Send + Sync> Middleware<C> for HeaderMiddleware {
    fn name(&self) -> &str {
        "header_middleware"
    }

    async fn process_request(
        &mut self,
        _client: &reqwest::Client,
        request: Request,
    ) -> Result<MiddlewareAction<Request>, SpiderError> {
        Ok(MiddlewareAction::Continue(request))
    }
}

Runtime integration:

let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(HeaderMiddleware)
    .build()
    .await?;

See full per-feature middleware examples in spider-middleware.

Pipeline Usage

Core pipelines (always available):

  • TransformPipeline
  • ValidationPipeline
  • DeduplicationPipeline
  • ConsolePipeline

Optional output pipelines (feature-gated):

  • JsonPipeline (pipeline-json)
  • JsonlPipeline (pipeline-jsonl)
  • CsvPipeline (pipeline-csv)
  • SqlitePipeline (pipeline-sqlite)
  • StreamJsonPipeline (pipeline-stream-json)
let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(
        TransformPipeline::new()
            .with_operation(TransformOperation::Trim { field: "title".into() }),
    )
    .add_pipeline(
        ValidationPipeline::new()
            .with_rule("title", ValidationRule::Required)
            .with_rule("title", ValidationRule::NonEmptyString),
    )
    .add_pipeline(DeduplicationPipeline::new(&["url"]))
    .add_pipeline(ConsolePipeline::new())
    .build()
    .await?;

Build a Custom Pipeline

Trait contract (Pipeline<I: ScrapedItem>):

#[async_trait]
pub trait Pipeline<I: ScrapedItem>: Send + Sync + 'static {
    fn name(&self) -> &str;
    async fn process_item(&self, item: I) -> Result<Option<I>, PipelineError>;
    async fn close(&self) -> Result<(), PipelineError> { Ok(()) }
    async fn get_state(&self) -> Result<Option<serde_json::Value>, PipelineError> { Ok(None) }
    async fn restore_state(&self, state: serde_json::Value) -> Result<(), PipelineError> { Ok(()) }
}

Minimal implementation:

use spider_lib::prelude::*;

struct MetricsPipeline;

#[async_trait]
impl<I: ScrapedItem> Pipeline<I> for MetricsPipeline {
    fn name(&self) -> &str {
        "metrics_pipeline"
    }

    async fn process_item(&self, item: MyItem) -> Result<Option<MyItem>, PipelineError> {
        // Record metrics, enrich, or filter items.
        Ok(Some(item))
    }
}

Runtime integration:

let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(MetricsPipeline)
    .build()
    .await?;

See full per-feature pipeline examples in spider-pipeline.

Feature Flag Cookbook

Minimal core crawler

[dependencies]
spider-lib = "3.0.0"

Robots + JSONL export

[dependencies]
spider-lib = { version = "3.0.0", features = ["middleware-robots", "pipeline-jsonl"] }

Proxy + user-agent rotation + CSV output

[dependencies]
spider-lib = { version = "3.0.0", features = ["middleware-proxy", "middleware-user-agent", "pipeline-csv"] }

Cache + autothrottle + SQLite output

[dependencies]
spider-lib = { version = "3.0.0", features = ["middleware-cache", "middleware-autothrottle", "pipeline-sqlite"] }

Live stats and resume support

[dependencies]
spider-lib = { version = "3.0.0", features = ["live-stats", "checkpoint"] }

Cookie-aware crawling

[dependencies]
spider-lib = { version = "3.0.0", features = ["cookie-store"] }

cookie-store enables middleware-cookies transitively.

Development

cargo check --workspace --all-targets
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
make check-all-features

Documentation

License

MIT. See LICENSE.