# spider-lib
A Rust-based web scraping framework inspired by Scrapy.
[](https://crates.io/crates/spider-lib)
[](https://docs.rs/spider-lib)
[](https://opensource.org/licenses/MIT)
[](https://github.com/mzyui/spider-lib/actions/workflows/rust.yml)
`spider-lib` is an asynchronous, concurrent web scraping library for Rust. It's designed to be a lightweight yet powerful tool for building and running scrapers for projects of any size. If you're familiar with Scrapy's architecture of Spiders, Middlewares, and Pipelines, you'll feel right at home.
## Getting Started
To use `spider-lib`, add it to your project's `Cargo.toml`:
```toml
[dependencies]
spider-lib = "0.2" # Check crates.io for the latest version
```
## Quick Example
Here's a minimal example of a spider that scrapes quotes from `quotes.toscrape.com`.
For convenience, `spider-lib` offers a prelude that re-exports the most commonly used items.
```rust
// Use the prelude for easy access to common types and traits.
use spider_lib::prelude::*;
use spider_lib::utils::ToSelector; // ToSelector is not in the prelude
#[scraped_item]
pub struct QuoteItem {
pub text: String,
pub author: String,
}
pub struct QuotesSpider;
#[async_trait]
impl Spider for QuotesSpider {
type Item = QuoteItem;
fn start_urls(&self) -> Vec<&'static str> {
vec!["http://quotes.toscrape.com/"]
}
async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
let html = response.to_html()?;
let mut output = ParseOutput::new();
for quote in html.select(&".quote".to_selector()?) {
let text = quote.select(&".text".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
let author = quote.select(&".author".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
output.add_item(QuoteItem { text, author });
}
if let Some(next_href) = html.select(&".next > a[href]".to_selector()?).next().and_then(|a| a.attr("href")) {
let next_url = response.url.join(next_href)?;
output.add_request(Request::new(next_url));
}
Ok(output)
}
}
#[tokio::main]
async fn main() -> Result<(), SpiderError> {
tracing_subscriber::fmt().with_max_level(tracing::Level::INFO).init();
// The builder defaults to using ReqwestClientDownloader
let crawler = CrawlerBuilder::<_, ReqwestClientDownloader>::new(QuotesSpider)
.build()
.await?;
crawler.start_crawl().await?;
Ok(())
}
```
## Features
* **Asynchronous & Concurrent:** `spider-lib` provides a high-performance, asynchronous web scraping framework built on `tokio`, leveraging an actor-like concurrency model for efficient task handling.
* **Graceful Shutdown:** Ensures clean termination on `Ctrl+C`, allowing in-flight tasks to complete and flushing all data.
* **Checkpoint and Resume:** Allows saving the crawler's state (scheduler, pipelines) to a file and resuming the crawl later, supporting both manual and periodic automatic saves. This includes salvaging un-processed requests.
* **Request Deduplication:** Utilizes request fingerprinting to prevent duplicate requests from being processed, ensuring efficiency and avoiding redundant work.
* **Familiar Architecture:** Leverages a modular design with Spiders, Middlewares, and Item Pipelines, drawing inspiration from Scrapy.
* **Configurable Concurrency:** Offers fine-grained control over the number of concurrent downloads, parsing workers, and pipeline processing for optimized performance.
* **Advanced Link Extraction:** Includes a powerful `Response` object method to comprehensively extract, resolve, and categorize various types of links from HTML content.
* **Fluent Configuration:** A `CrawlerBuilder` API simplifies the assembly and configuration of your web crawler.
#### Built-in Middlewares
The following middlewares are included by default:
* **Rate Limiting:** Controls request rates to prevent server overload.
* **Retries:** Automatically retries failed or timed-out requests.
* **User-Agent Rotation:** Manages and rotates user agents.
* **Referer Management:** Handles the `Referer` header.
Additional middlewares are available via feature flags:
* **HTTP Caching:** Caches responses to accelerate development (`middleware-http-cache`).
* **Respect Robots.txt:** Adheres to `robots.txt` rules (`middleware-robots-txt`).
#### Built-in Item Pipelines
The following pipelines are included by default:
* **Deduplication:** Filters out duplicate items based on a configurable key.
* **Console Writer:** A simple pipeline for printing items to the console.
Exporter pipelines are available via feature flags:
* **JSON / JSON Lines:** Saves items to `.json` or `.jsonl` files (`pipeline-json`).
* **CSV:** Saves items to `.csv` files (`pipeline-csv`).
* **SQLite:** Saves items to a SQLite database (`pipeline-sqlite`).
For complete, runnable examples, please refer to the `examples/` directory in this repository. You can run an example using `cargo run --example <example_name> --features <features>`, for instance: `cargo run --example quotes --features "pipeline-json"`.
## Feature Flags
`spider-lib` uses feature flags to keep the core library lightweight while allowing for optional functionality. To use a feature, add it to your `Cargo.toml`.
* `pipeline-json`: Enables `JsonWriterPipeline` and `JsonlWriterPipeline`.
* `pipeline-csv`: Enables `CsvExporterPipeline`.
* `pipeline-sqlite`: Enables `SqliteWriterPipeline`.
* `middleware-http-cache`: Enables `HttpCacheMiddleware` for disk-based response caching.
* `middleware-robots-txt`: Enables `RobotsTxtMiddleware` for respecting `robots.txt` files.
* `checkpoint`: Enables crawler checkpointing for saving and resuming crawls.
Example of enabling multiple features:
```toml
[dependencies]
spider-lib = { version = "0.3", features = ["pipeline-json", "middleware-http-cache"] }
```