spider-lib 0.4.1

# spider-lib

A Rust-based web scraping framework inspired by Scrapy.

[![crates.io](https://img.shields.io/crates/v/spider-lib.svg)](https://crates.io/crates/spider-lib)
[![docs.rs](https://docs.rs/spider-lib/badge.svg)](https://docs.rs/spider-lib)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

`spider-lib` is an asynchronous, concurrent web scraping library for Rust. It's designed to be a lightweight yet powerful tool for building and running scrapers for projects of any size. If you're familiar with Scrapy's architecture of Spiders, Middlewares, and Pipelines, you'll feel right at home.

## Getting Started

To use `spider-lib`, add it to your project's `Cargo.toml`:

```toml
[dependencies]
spider-lib = "0.2" # Check crates.io for the latest version
```

## Quick Example

Here's a minimal example of a spider that scrapes quotes from `quotes.toscrape.com`.

For convenience, `spider-lib` offers a prelude that re-exports the most commonly used items.

```rust
// Use the prelude for easy access to common types and traits.
use spider_lib::prelude::*;
use spider_lib::utils::ToSelector; // ToSelector is not in the prelude

#[scraped_item]
pub struct QuoteItem {
    pub text: String,
    pub author: String,
}

pub struct QuotesSpider;

#[async_trait]
impl Spider for QuotesSpider {
    type Item = QuoteItem;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["http://quotes.toscrape.com/"]
    }

    async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
        let html = response.to_html()?;
        let mut output = ParseOutput::new();

        for quote in html.select(&".quote".to_selector()?) {
            let text = quote.select(&".text".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
            let author = quote.select(&".author".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
            output.add_item(QuoteItem { text, author });
        }

        if let Some(next_href) = html.select(&".next > a[href]".to_selector()?).next().and_then(|a| a.attr("href")) {
            let next_url = response.url.join(next_href)?;
            output.add_request(Request::new(next_url));
        }

        Ok(output)
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    tracing_subscriber::fmt().with_max_level(tracing::Level::INFO).init();

    // The builder defaults to using ReqwestClientDownloader
    let crawler = CrawlerBuilder::new(QuotesSpider)
        .build()
        .await?;

    crawler.start_crawl().await?;

    Ok(())
}
```

## Features

*   **Asynchronous & Concurrent:** `spider-lib` provides a high-performance, asynchronous web scraping framework built on `tokio`, leveraging an actor-like concurrency model for efficient task handling.
*   **Crawl Statistics:** Automatically collects and logs comprehensive statistics about the crawl's progress, including requests, responses (with status codes), items scraped, and downloaded bytes. The `StatCollector` can also be accessed programmatically via `crawler.get_stats()` for custom reporting and integration.
*   **Graceful Shutdown:** Ensures clean termination on `Ctrl+C`, allowing in-flight tasks to complete and flushing all data.
*   **Checkpoint and Resume:** Allows saving the crawler's state (scheduler, pipelines) to a file and resuming the crawl later, supporting both manual and periodic automatic saves. This includes salvaging un-processed requests.
*   **Request Deduplication:** Utilizes request fingerprinting to prevent duplicate requests from being processed, ensuring efficiency and avoiding redundant work.
*   **Familiar Architecture:** Leverages a modular design with Spiders, Middlewares, and Item Pipelines, drawing inspiration from Scrapy.
*   **Configurable Concurrency:** Offers fine-grained control over the number of concurrent downloads, parsing workers, and pipeline processing for optimized performance.
*   **Advanced Link Extraction:** Includes a powerful `Response` object method to comprehensively extract, resolve, and categorize various types of links from HTML content.
*   **Fluent Configuration:** A `CrawlerBuilder` API simplifies the assembly and configuration of your web crawler.

For complete, runnable examples, please refer to the `examples/` directory in this repository. You can run an example using `cargo run --example <example_name> --features <features>`, for instance: `cargo run --example quotes --features "pipeline-json"`.

## Configuration Examples

While `spider-lib` provides sensible defaults, you can finely tune its behavior by configuring middlewares, pipelines, and the crawler itself.

### Middlewares

Middlewares inspect and modify requests and responses. They can be added to the `CrawlerBuilder`.

The following middlewares are included by default:
*   **Rate Limiting:** Controls request rates to prevent server overload.
*   **Retries:** Automatically retries failed or timed-out requests.
*   **User-Agent Rotation:** Manages and rotates user agents.
*   **Referer Management:** Handles the `Referer` header.

Additional middlewares are available via feature flags:
*   **Cookie Management:** Persists cookies across requests to maintain sessions (`middleware-cookies`).
*   **HTTP Caching:** Caches responses to accelerate development (`middleware-http-cache`).
*   **Respect Robots.txt:** Adheres to `robots.txt` rules (`middleware-robots-txt`).

#### `CookieMiddleware`

This middleware automatically manages cookies to maintain sessions across requests, which is essential for scraping sites that require logins. It is enabled via the `middleware-cookies` feature. For robust operation, it's also integrated with the checkpointing system, so cookie sessions are saved and restored along with the rest of the crawl state.

```rust,no_run
use spider_lib::prelude::*;
// Make sure to enable the `middleware-cookies` feature in Cargo.toml
use spider_lib::middlewares::cookies::CookieMiddleware;
use cookie_store::CookieStore;
use std::sync::Arc;
use tokio::sync::Mutex;

// ... inside your main async function
let cookie_store = Arc::new(Mutex::new(CookieStore::default()));

let crawler = CrawlerBuilder::new(YourSpider) // Assumes `YourSpider` is a defined Spider
    .add_middleware(CookieMiddleware::new(cookie_store.clone()))
    .build()
    .await?;
```

#### `UserAgentMiddleware`

This middleware manages and rotates User-Agent strings. It can be configured with different rotation strategies, User-Agent sources, and even apply different rules for different domains.

**Available Strategies (`UserAgentRotationStrategy`):**
*   `Random`: (Default) Selects a User-Agent randomly.
*   `Sequential`: Cycles through the list of User-Agents in order.
*   `Sticky`: On first encounter, a User-Agent is "stuck" to a domain for the entire crawl.
*   `StickySession`: A User-Agent is "stuck" to a domain for a configured duration.

```rust,no_run
use spider_lib::prelude::*;
use spider_lib::middlewares::user_agent::{
    UserAgentMiddleware, UserAgentRotationStrategy, UserAgentSource, BuiltinUserAgentList
};
use std::time::Duration;

// ... inside your main async function
let ua_middleware = UserAgentMiddleware::builder()
    // Set the default strategy for all domains
    .strategy(UserAgentRotationStrategy::Random)
    // Set the default source of User-Agents
    .source(UserAgentSource::Builtin(BuiltinUserAgentList::Chrome))
    // Set the session duration for the `StickySession` strategy
    .session_duration(Duration::from_secs(60 * 5))
    // Use a different User-Agent source specifically for "example.org"
    .per_domain_source(
        "example.org".to_string(),
        UserAgentSource::Builtin(BuiltinUserAgentList::Firefox)
    )
    // Use a different strategy for "example.com"
    .per_domain_strategy(
        "example.com".to_string(),
        UserAgentRotationStrategy::Sticky
    )
    .build()?;
```

#### `RateLimitMiddleware`

This middleware controls the request rate to avoid overloading servers. By default, it uses an adaptive limiter on a per-domain basis. You can configure it to use a fixed rate instead.

```rust,no_run
use spider_lib::prelude::*;
use spider_lib::middlewares::rate_limit::{RateLimitMiddleware, Scope};

// ... inside your main async function
let rate_limit_middleware = RateLimitMiddleware::builder()
    // Apply one rate limit across all domains
    .scope(Scope::Global)
    // Use a token bucket algorithm to allow 5 requests per second
    .use_token_bucket_limiter(5)
    .build();
```

#### `HttpCacheMiddleware`

This middleware caches HTTP responses to disk, which can significantly speed up development and re-runs by avoiding redundant network requests. It's enabled via the `middleware-http-cache` feature.

```rust,no_run
use spider_lib::prelude::*;
// Make sure to enable the `middleware-http-cache` feature in Cargo.toml
use spider_lib::middlewares::http_cache::HttpCacheMiddleware;
use std::path::PathBuf;

// ... inside your main async function
let http_cache_middleware = HttpCacheMiddleware::builder()
    // Set a custom directory for storing cache files
    .cache_dir(PathBuf::from("output/http_cache"))
    .build()?;
```

#### `RefererMiddleware`

This middleware automatically manages the `Referer` HTTP header, simulating natural browsing behavior.

```rust,no_run
use spider_lib::prelude::*;
use spider_lib::middlewares::referer::RefererMiddleware;

// ... inside your main async function
let referer_middleware = RefererMiddleware::new()
    // Ensure referer is only set for requests to the same origin
    .same_origin_only(true)
    // Keep a maximum of 500 referer URLs in memory
    .max_chain_length(500)
    // Do not include URL fragments in the referer header
    .include_fragment(false);
```

#### `RetryMiddleware`

This middleware automatically retries failed requests based on HTTP status codes or network errors, using an exponential backoff strategy.

```rust,no_run
use spider_lib::prelude::*;
use spider_lib::middlewares::retry::RetryMiddleware;
use std::time::Duration;

// ... inside your main async function
let retry_middleware = RetryMiddleware::new()
    // Allow up to 5 retry attempts
    .max_retries(5)
    // Define which HTTP status codes should trigger a retry
    .retry_http_codes(vec![500, 502, 503, 504, 408, 429])
    // Set the exponential backoff factor
    .backoff_factor(2.0)
    // Cap the maximum delay between retries at 300 seconds (5 minutes)
    .max_delay(Duration::from_secs(300));
```

#### `RobotsTxtMiddleware`

This middleware respects `robots.txt` rules, preventing the crawler from accessing disallowed paths. It's enabled via the `middleware-robots-txt` feature.

```rust,no_run
use spider_lib::prelude::*;
// Make sure to enable the `middleware-robots-txt` feature in Cargo.toml
use spider_lib::middlewares::robots_txt::RobotsTxtMiddleware;
use std::time::Duration;

// ... inside your main async function
let robots_txt_middleware = RobotsTxtMiddleware::new()
    // Cache robots.txt rules for 12 hours
    .cache_ttl(Duration::from_secs(60 * 60 * 12))
    // Store up to 5000 robots.txt files in cache
    .cache_capacity(5_000)
    // Set a timeout of 10 seconds for fetching robots.txt files
    .request_timeout(Duration::from_secs(10));
```

### Pipelines

Item Pipelines are used for processing, filtering, or saving scraped items.

The following pipelines are included by default:
*   **Deduplication:** Filters out duplicate items based on a configurable key.
*   **Console Writer:** A simple pipeline for printing items to the console.

Exporter pipelines are available via feature flags:
*   **JSON / JSON Lines:** Saves items to `.json` or `.jsonl` files (`pipeline-json`).
*   **CSV:** Saves items to `.csv` files (`pipeline-csv`).
*   **SQLite:** Saves items to a SQLite database (`pipeline-sqlite`).

#### `ConsoleWriterPipeline`

A simple pipeline that prints each scraped item to the console. Useful for debugging.

```rust,no_run
use spider_lib::prelude::*;
use spider_lib::pipelines::console_writer::ConsoleWriterPipeline;

// ... inside your main async function
let console_pipeline = ConsoleWriterPipeline::new();
```

#### `DeduplicationPipeline`

This pipeline filters out duplicate items based on a configurable set of fields.

```rust,no_run
use spider_lib::prelude::*;
use spider_lib::pipelines::deduplication::DeduplicationPipeline;

// ... inside your main async function
let deduplication_pipeline = DeduplicationPipeline::new(&["url", "title"]);
```

#### `JsonWriterPipeline` & `JsonlWriterPipeline`

These pipelines save scraped items to a file. They are enabled with the `pipeline-json` feature.
*   `JsonWriterPipeline`: Collects all items and writes them to a single, pretty-printed JSON array at the end of the crawl.
*   `JsonlWriterPipeline`: Writes each item as a separate JSON object on a new line, which is efficient for streaming large amounts of data.

```rust,no_run
use spider_lib::prelude::*;
// Make sure to enable the `pipeline-json` feature in Cargo.toml
use spider_lib::pipelines::json_writer::JsonWriterPipeline;
use spider_lib::pipelines::jsonl_writer::JsonlWriterPipeline;

// ... inside your main async function
let json_pipeline = JsonWriterPipeline::new("output/items.json")?;
let jsonl_pipeline = JsonlWriterPipeline::new("output/items.jsonl")?;

let crawler = CrawlerBuilder::new(YourSpider) // Assumes `YourSpider` is a defined Spider
    .add_pipeline(json_pipeline)
    .add_pipeline(jsonl_pipeline)
    // ... configure other middlewares
    .build()
    .await?;
```

#### `CsvExporterPipeline`

This pipeline saves items to a CSV file, enabled with the `pipeline-csv` feature. The CSV headers are automatically inferred from the fields of the first item scraped.

```rust,no_run
use spider_lib::prelude::*;
// Make sure to enable the `pipeline-csv` feature in Cargo.toml
use spider_lib::pipelines::csv_exporter::CsvExporterPipeline;

// ... inside your main async function
let csv_pipeline = CsvExporterPipeline::new("output/items.csv")?;

let crawler = CrawlerBuilder::new(YourSpider) // Assumes `YourSpider` is a defined Spider
    .add_pipeline(csv_pipeline)
    // ... configure other middlewares
    .build()
    .await?;
```

#### `SqliteWriterPipeline`

This pipeline saves items to a SQLite database, enabled with the `pipeline-sqlite` feature. The table schema is automatically inferred from the fields of the first item scraped.

```rust,no_run
use spider_lib::prelude::*;
// Make sure to enable the `pipeline-sqlite` feature in Cargo.toml
use spider_lib::pipelines::sqlite_writer::SqliteWriterPipeline;

// ... inside your main async function
let sqlite_pipeline = SqliteWriterPipeline::new("output/items.db", "scraped_data")?;

let crawler = CrawlerBuilder::new(YourSpider) // Assumes `YourSpider` is a defined Spider
    .add_pipeline(sqlite_pipeline)
    // ... configure other middlewares
    .build()
    .await?;
```

### Crawler Settings

You can configure the core behavior of the crawler, such as concurrency and checkpointing.

#### Checkpointing & Resuming

This feature allows a crawl to be paused and resumed later. When the crawler starts, it will load the state from the checkpoint file if it exists. This feature is enabled by the `checkpoint` flag.

```rust,no_run
use spider_lib::prelude::*;
use std::time::Duration;

// ... inside your main async function
let crawler = CrawlerBuilder::new(YourSpider) // Assumes `YourSpider` is a defined Spider
    // Set the path to save/load the checkpoint file
    .with_checkpoint_path("output/my_crawl.checkpoint")
    // Automatically save the state every 10 minutes
    .with_checkpoint_interval(Duration::from_secs(60 * 10))
    // ... configure your other middlewares, and pipelines
    .build()
    .await?;
```

#### Concurrency

You can control the parallelism of different parts of the crawl to manage system resources and target server load.

```rust,no_run
use spider_lib::prelude::*;

// ... inside your main async function
let crawler = CrawlerBuilder::new(YourSpider) // Assumes `YourSpider` is a defined Spider
    // Set the maximum number of concurrent downloads
    .max_concurrent_downloads(10)
    // Set the number of CPU workers for parsing responses
    .max_parser_workers(4)
    // Set the maximum number of items to be processed by pipelines concurrently
    .max_concurrent_pipelines(20)
    // ... configure your other middlewares, and pipelines
    .build()
    .await?;
```


## Feature Flags

`spider-lib` uses feature flags to keep the core library lightweight while allowing for optional functionality. To use a feature, add it to your `Cargo.toml`.

| Feature Flag | Enables | Description |
| :--- | :--- | :--- |
| **Pipelines** | | |
| `pipeline-json` | `JsonWriterPipeline`, `JsonlWriterPipeline` | Saves items to `.json` or `.jsonl` files. |
| `pipeline-csv` | `CsvExporterPipeline` | Saves items to a `.csv` file. |
| `pipeline-sqlite`| `SqliteWriterPipeline` | Saves items to a SQLite database. |
| **Middlewares** | | |
| `middleware-cookies` | `CookieMiddleware` | Manages cookies and sessions across requests. |
| `middleware-http-cache` | `HttpCacheMiddleware` | Caches HTTP responses to disk to speed up development. |
| `middleware-robots-txt` | `RobotsTxtMiddleware` | Respects `robots.txt` rules for websites. |
| **Core** | | |
| `checkpoint` | Checkpointing System | Enables saving and resuming crawl state. |

Example of enabling multiple features:

```toml
[dependencies]
spider-lib = { version = "0.3", features = ["pipeline-json", "middleware-http-cache", "checkpoint"] }
```