spider-lib 0.3.2

A Rust-based web scraping framework inspired by Scrapy (Python).
Documentation
# spider-lib

A Rust-based web scraping framework inspired by Scrapy.

[![crates.io](https://img.shields.io/crates/v/spider-lib.svg)](https://crates.io/crates/spider-lib)
[![docs.rs](https://docs.rs/spider-lib/badge.svg)](https://docs.rs/spider-lib)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Build Status](https://github.com/mzyui/spider-lib/actions/workflows/rust.yml/badge.svg)](https://github.com/mzyui/spider-lib/actions/workflows/rust.yml)

`spider-lib` is an asynchronous, concurrent web scraping library for Rust. It's designed to be a lightweight yet powerful tool for building and running scrapers for projects of any size. If you're familiar with Scrapy's architecture of Spiders, Middlewares, and Pipelines, you'll feel right at home.

## Getting Started

To use `spider-lib`, add it to your project's `Cargo.toml`:

```toml
[dependencies]
spider-lib = "0.2" # Check crates.io for the latest version
```

## Quick Example

Here's a minimal example of a spider that scrapes quotes from `quotes.toscrape.com`.

For convenience, `spider-lib` offers a prelude that re-exports the most commonly used items.

```rust
// Use the prelude for easy access to common types and traits.
use spider_lib::prelude::*;
use spider_lib::utils::ToSelector; // ToSelector is not in the prelude

#[scraped_item]
pub struct QuoteItem {
    pub text: String,
    pub author: String,
}

pub struct QuotesSpider;

#[async_trait]
impl Spider for QuotesSpider {
    type Item = QuoteItem;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["http://quotes.toscrape.com/"]
    }

    async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
        let html = response.to_html()?;
        let mut output = ParseOutput::new();

        for quote in html.select(&".quote".to_selector()?) {
            let text = quote.select(&".text".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
            let author = quote.select(&".author".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
            output.add_item(QuoteItem { text, author });
        }

        if let Some(next_href) = html.select(&".next > a[href]".to_selector()?).next().and_then(|a| a.attr("href")) {
            let next_url = response.url.join(next_href)?;
            output.add_request(Request::new(next_url));
        }

        Ok(output)
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    tracing_subscriber::fmt().with_max_level(tracing::Level::INFO).init();

    // The builder defaults to using ReqwestClientDownloader
    let crawler = CrawlerBuilder::<_, ReqwestClientDownloader>::new(QuotesSpider)
        .build()
        .await?;

    crawler.start_crawl().await?;

    Ok(())
}
```

## Features

*   **Asynchronous & Concurrent:** `spider-lib` provides a high-performance, asynchronous web scraping framework built on `tokio`, leveraging an actor-like concurrency model for efficient task handling.
*   **Graceful Shutdown:** Ensures clean termination on `Ctrl+C`, allowing in-flight tasks to complete and flushing all data.
*   **Checkpoint and Resume:** Allows saving the crawler's state (scheduler, pipelines) to a file and resuming the crawl later, supporting both manual and periodic automatic saves. This includes salvaging un-processed requests.
*   **Request Deduplication:** Utilizes request fingerprinting to prevent duplicate requests from being processed, ensuring efficiency and avoiding redundant work.
*   **Familiar Architecture:** Leverages a modular design with Spiders, Middlewares, and Item Pipelines, drawing inspiration from Scrapy.
*   **Configurable Concurrency:** Offers fine-grained control over the number of concurrent downloads, parsing workers, and pipeline processing for optimized performance.
*   **Advanced Link Extraction:** Includes a powerful `Response` object method to comprehensively extract, resolve, and categorize various types of links from HTML content.
*   **Fluent Configuration:** A `CrawlerBuilder` API simplifies the assembly and configuration of your web crawler.

#### Built-in Middlewares

The following middlewares are included by default:
*   **Rate Limiting:** Controls request rates to prevent server overload.
*   **Retries:** Automatically retries failed or timed-out requests.
*   **User-Agent Rotation:** Manages and rotates user agents.
*   **Referer Management:** Handles the `Referer` header.

Additional middlewares are available via feature flags:
*   **HTTP Caching:** Caches responses to accelerate development (`middleware-http-cache`).
*   **Respect Robots.txt:** Adheres to `robots.txt` rules (`middleware-robots-txt`).

#### Built-in Item Pipelines

The following pipelines are included by default:
*   **Deduplication:** Filters out duplicate items based on a configurable key.
*   **Console Writer:** A simple pipeline for printing items to the console.

Exporter pipelines are available via feature flags:
*   **JSON / JSON Lines:** Saves items to `.json` or `.jsonl` files (`pipeline-json`).
*   **CSV:** Saves items to `.csv` files (`pipeline-csv`).
*   **SQLite:** Saves items to a SQLite database (`pipeline-sqlite`).

For complete, runnable examples, please refer to the `examples/` directory in this repository. You can run an example using `cargo run --example <example_name> --features <features>`, for instance: `cargo run --example quotes --features "pipeline-json"`.

## Feature Flags

`spider-lib` uses feature flags to keep the core library lightweight while allowing for optional functionality. To use a feature, add it to your `Cargo.toml`.

*   `pipeline-json`: Enables `JsonWriterPipeline` and `JsonlWriterPipeline`.
*   `pipeline-csv`: Enables `CsvExporterPipeline`.
*   `pipeline-sqlite`: Enables `SqliteWriterPipeline`.
*   `middleware-http-cache`: Enables `HttpCacheMiddleware` for disk-based response caching.
*   `middleware-robots-txt`: Enables `RobotsTxtMiddleware` for respecting `robots.txt` files.
*   `checkpoint`: Enables crawler checkpointing for saving and resuming crawls.

Example of enabling multiple features:

```toml
[dependencies]
spider-lib = { version = "0.3", features = ["pipeline-json", "middleware-http-cache"] }
```