spider-lib 2.0.4

A Rust-based web scraping framework inspired by Scrapy (Python).
Documentation
# spider-lib

A modular Rust web scraping framework inspired by Scrapy.

`spider-lib` is the facade crate for the workspace. It re-exports core crawling, downloader, middleware, pipeline, utility, and macro APIs so you can start with one dependency and enable only the features you need.

## Workspace Crates

- [`spider-core`]./spider-core/README.md: crawler runtime, spider trait, scheduler, builder, state, and stats.
- [`spider-downloader`]./spider-downloader/README.md: downloader traits and reqwest-based downloader implementation.
- [`spider-macro`]./spider-macro/README.md: procedural macros such as `#[scraped_item]`.
- [`spider-middleware`]./spider-middleware/README.md: retry, rate limiting, robots, cookies, proxy, cache, and user-agent middleware.
- [`spider-pipeline`]./spider-pipeline/README.md: item processing and output pipelines (JSON, JSONL, CSV, SQLite, stream JSON).
- [`spider-util`]./spider-util/README.md: shared request/response/item/error types and helper utilities.

## Installation

```toml
[dependencies]
spider-lib = "2.0.4"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
```

`serde` and `serde_json` are required when using `#[scraped_item]`.

## Quick Start

```rust,no_run
use spider_lib::prelude::*;

#[scraped_item]
struct QuoteItem {
    text: String,
}

#[derive(Clone, Default)]
struct QuoteState;

struct QuoteSpider;

#[async_trait]
impl Spider for QuoteSpider {
    type Item = QuoteItem;
    type State = QuoteState;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://quotes.toscrape.com/"]
    }

    async fn parse(
        &self,
        _response: Response,
        _state: &Self::State,
    ) -> Result<ParseOutput<Self::Item>, SpiderError> {
        Ok(ParseOutput::new())
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(QuoteSpider).build().await?;
    crawler.start_crawl().await
}
```

Try the maintained examples:

```bash
cargo run --example books
cargo run --example books_live --features live-stats
```

## Feature Flags

Default feature: `core`.

Middleware features:

- `middleware-cache`
- `middleware-autothrottle`
- `middleware-proxy`
- `middleware-user-agent`
- `middleware-robots`
- `middleware-cookies`

Pipeline features:

- `pipeline-csv`
- `pipeline-json`
- `pipeline-jsonl`
- `pipeline-sqlite`
- `pipeline-stream-json`

Core features:

- `live-stats`: enables in-place terminal stat updates.
- `checkpoint`: enables checkpoint/resume support.
- `cookie-store`: enables cookie store integration (also enables `middleware-cookies`).

Example:

```toml
[dependencies]
spider-lib = { version = "2.0.4", features = ["middleware-robots", "pipeline-jsonl"] }
```

## Using Workspace Crates Directly

Use `spider-lib` when you want the integrated API surface.

Use sub-crates directly if you need tighter dependency control or only one subsystem (for example, custom downloader integration with `spider-downloader`, or utility types from `spider-util`).

## Development

```bash
cargo check --workspace --all-targets
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo test --all-features
make check-all-features
```

## Documentation

- API docs: <https://docs.rs/spider-lib>
- Contribution guide: [CONTRIBUTING.md]./CONTRIBUTING.md

## License

MIT. See [LICENSE](./LICENSE).