spider-lib 3.0.0

A Rust-based web scraping framework inspired by Scrapy (Python).
Documentation
# spider-lib

A modular Rust web scraping framework inspired by Scrapy.

`spider-lib` is the facade crate for this workspace. It re-exports crawler runtime, downloader, middleware, pipelines, utility types, and macros so you can start with one dependency and enable only the features you need.

## Table of Contents

- [Workspace Crates]#workspace-crates
- [Architecture at a Glance]#architecture-at-a-glance
- [Installation]#installation
- [Quick Start]#quick-start
- [Downloader Usage]#downloader-usage
- [Middleware Usage]#middleware-usage
- [Pipeline Usage]#pipeline-usage
- [Feature Flag Cookbook]#feature-flag-cookbook
- [Development]#development
- [Documentation]#documentation
- [License]#license

## Workspace Crates

- [`spider-core`]./spider-core/README.md: crawler runtime, spider trait, scheduler, builder, state, and stats.
- [`spider-downloader`]./spider-downloader/README.md: downloader traits and reqwest-based downloader implementation.
- [`spider-macro`]./spider-macro/README.md: procedural macros such as `#[scraped_item]`.
- [`spider-middleware`]./spider-middleware/README.md: retry, rate limiting, robots, cookies, proxy, cache, and user-agent middleware.
- [`spider-pipeline`]./spider-pipeline/README.md: item processing and output pipelines (JSON, JSONL, CSV, SQLite, stream JSON).
- [`spider-util`]./spider-util/README.md: shared request/response/item/error types and helper utilities.

## Architecture at a Glance

`Spider` generates initial URLs and parses responses, while the crawler orchestrates request execution and data flow.

```text
Spider::start_urls
  -> Scheduler
  -> Downloader (default: reqwest)
  -> Middleware chain (before/after download)
  -> Spider::parse(Response) -> ParseOutput { requests, items }
  -> Pipeline chain (transform/validate/dedup/export)
```

## Installation

```toml
[dependencies]
spider-lib = "3.0.0"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
```

`serde` and `serde_json` are required when using `#[scraped_item]`.

## Quick Start

```rust,no_run
use spider_lib::prelude::*;

#[scraped_item]
struct QuoteItem {
    text: String,
    author: String,
}

#[derive(Clone, Default)]
struct QuoteState;

struct QuoteSpider;

#[async_trait]
impl Spider for QuoteSpider {
    type Item = QuoteItem;
    type State = QuoteState;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://quotes.toscrape.com/"]
    }

    async fn parse(
        &self,
        _response: Response,
        _state: &Self::State,
    ) -> Result<ParseOutput<Self::Item>, SpiderError> {
        Ok(ParseOutput::new())
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(QuoteSpider)
        .add_middleware(RateLimitMiddleware::default())
        .add_middleware(RetryMiddleware::new())
        .add_pipeline(ConsolePipeline::new())
        .build()
        .await?;

    crawler.start_crawl().await
}
```

Try maintained examples:

```bash
cargo run --example books
cargo run --example books_live --features live-stats
```

## Downloader Usage

### Default downloader

`CrawlerBuilder` uses a reqwest-based downloader by default, so no extra setup is needed for most projects.

### Build a Custom Downloader

Use a custom downloader when you need custom transport behavior (special auth, alternate HTTP stack, tracing, etc.).

Trait contract (`Downloader`):

```rust,ignore
#[async_trait]
pub trait Downloader: Send + Sync + 'static {
    type Client: Send + Sync;
    async fn download(&self, request: Request) -> Result<Response, SpiderError>;
    fn client(&self) -> &Self::Client;
}
```

```rust,ignore
use async_trait::async_trait;
use spider_lib::prelude::*;

struct SignedDownloader {
    client: reqwest::Client,
}

#[async_trait]
impl Downloader for SignedDownloader {
    type Client = reqwest::Client;

    async fn download(&self, request: Request) -> Result<Response, SpiderError> {
        let _req = request;
        // Sign request, execute HTTP call, map response.
        todo!()
    }

    fn client(&self) -> &Self::Client {
        &self.client
    }
}
```

Runtime integration:

```rust,ignore
let crawler = CrawlerBuilder::new(MySpider)
    .downloader(SignedDownloader {
        client: reqwest::Client::new(),
    })
    .build()
    .await?;
```

For trait details and lower-level integration, see [`spider-downloader`](./spider-downloader/README.md).

## Middleware Usage

Core middleware (always available):

- `RateLimitMiddleware`: controls request throughput.
- `RetryMiddleware`: retries failed/transient requests.
- `RefererMiddleware`: populates `Referer` header for follow-up requests.

Optional middleware (feature-gated):

- `HttpCacheMiddleware` (`middleware-cache`)
- `AutoThrottleMiddleware` (`middleware-autothrottle`)
- `ProxyMiddleware` (`middleware-proxy`)
- `UserAgentMiddleware` (`middleware-user-agent`)
- `RobotsTxtMiddleware` (`middleware-robots`)
- `CookieMiddleware` (`middleware-cookies`)

```rust,ignore
let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(RateLimitMiddleware::default())
    .add_middleware(RetryMiddleware::new())
    .add_middleware(RefererMiddleware::new())
    .build()
    .await?;
```

### Build a Custom Middleware

Trait contract (`Middleware<C>`):

```rust,ignore
#[async_trait]
pub trait Middleware<C: Send + Sync>: Any + Send + Sync + 'static {
    fn name(&self) -> &str;
    async fn process_request(
        &mut self,
        client: &C,
        request: Request,
    ) -> Result<MiddlewareAction<Request>, SpiderError> {
        Ok(MiddlewareAction::Continue(request))
    }
    async fn process_response(
        &mut self,
        response: Response,
    ) -> Result<MiddlewareAction<Response>, SpiderError> {
        Ok(MiddlewareAction::Continue(response))
    }
    async fn handle_error(
        &mut self,
        request: &Request,
        error: &SpiderError,
    ) -> Result<MiddlewareAction<Request>, SpiderError> {
        Err(error.clone())
    }
}
```

Minimal implementation (override only what you need):

```rust,ignore
use spider_lib::prelude::*;

struct HeaderMiddleware;

#[async_trait]
impl<C: Send + Sync> Middleware<C> for HeaderMiddleware {
    fn name(&self) -> &str {
        "header_middleware"
    }

    async fn process_request(
        &mut self,
        _client: &reqwest::Client,
        request: Request,
    ) -> Result<MiddlewareAction<Request>, SpiderError> {
        Ok(MiddlewareAction::Continue(request))
    }
}
```

Runtime integration:

```rust,ignore
let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(HeaderMiddleware)
    .build()
    .await?;
```

See full per-feature middleware examples in [`spider-middleware`](./spider-middleware/README.md).

## Pipeline Usage

Core pipelines (always available):

- `TransformPipeline`
- `ValidationPipeline`
- `DeduplicationPipeline`
- `ConsolePipeline`

Optional output pipelines (feature-gated):

- `JsonPipeline` (`pipeline-json`)
- `JsonlPipeline` (`pipeline-jsonl`)
- `CsvPipeline` (`pipeline-csv`)
- `SqlitePipeline` (`pipeline-sqlite`)
- `StreamJsonPipeline` (`pipeline-stream-json`)

```rust,ignore
let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(
        TransformPipeline::new()
            .with_operation(TransformOperation::Trim { field: "title".into() }),
    )
    .add_pipeline(
        ValidationPipeline::new()
            .with_rule("title", ValidationRule::Required)
            .with_rule("title", ValidationRule::NonEmptyString),
    )
    .add_pipeline(DeduplicationPipeline::new(&["url"]))
    .add_pipeline(ConsolePipeline::new())
    .build()
    .await?;
```

### Build a Custom Pipeline

Trait contract (`Pipeline<I: ScrapedItem>`):

```rust,ignore
#[async_trait]
pub trait Pipeline<I: ScrapedItem>: Send + Sync + 'static {
    fn name(&self) -> &str;
    async fn process_item(&self, item: I) -> Result<Option<I>, PipelineError>;
    async fn close(&self) -> Result<(), PipelineError> { Ok(()) }
    async fn get_state(&self) -> Result<Option<serde_json::Value>, PipelineError> { Ok(None) }
    async fn restore_state(&self, state: serde_json::Value) -> Result<(), PipelineError> { Ok(()) }
}
```

Minimal implementation:

```rust,ignore
use spider_lib::prelude::*;

struct MetricsPipeline;

#[async_trait]
impl<I: ScrapedItem> Pipeline<I> for MetricsPipeline {
    fn name(&self) -> &str {
        "metrics_pipeline"
    }

    async fn process_item(&self, item: MyItem) -> Result<Option<MyItem>, PipelineError> {
        // Record metrics, enrich, or filter items.
        Ok(Some(item))
    }
}
```

Runtime integration:

```rust,ignore
let crawler = CrawlerBuilder::new(MySpider)
    .add_pipeline(MetricsPipeline)
    .build()
    .await?;
```

See full per-feature pipeline examples in [`spider-pipeline`](./spider-pipeline/README.md).

## Feature Flag Cookbook

### Minimal core crawler

```toml
[dependencies]
spider-lib = "3.0.0"
```

### Robots + JSONL export

```toml
[dependencies]
spider-lib = { version = "3.0.0", features = ["middleware-robots", "pipeline-jsonl"] }
```

### Proxy + user-agent rotation + CSV output

```toml
[dependencies]
spider-lib = { version = "3.0.0", features = ["middleware-proxy", "middleware-user-agent", "pipeline-csv"] }
```

### Cache + autothrottle + SQLite output

```toml
[dependencies]
spider-lib = { version = "3.0.0", features = ["middleware-cache", "middleware-autothrottle", "pipeline-sqlite"] }
```

### Live stats and resume support

```toml
[dependencies]
spider-lib = { version = "3.0.0", features = ["live-stats", "checkpoint"] }
```

### Cookie-aware crawling

```toml
[dependencies]
spider-lib = { version = "3.0.0", features = ["cookie-store"] }
```

`cookie-store` enables `middleware-cookies` transitively.

## Development

```bash
cargo check --workspace --all-targets
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
make check-all-features
```

## Documentation

- API docs: <https://docs.rs/spider-lib>
- Contribution guide: [CONTRIBUTING.md]./CONTRIBUTING.md

## License

MIT. See [LICENSE](./LICENSE).