# spider-lib
A Rust-based web scraping framework inspired by Scrapy.
[](https://crates.io/crates/spider-lib)
[](https://docs.rs/spider-lib)
[](https://opensource.org/licenses/MIT)
[](https://github.com/mzyui/spider-lib/actions/workflows/rust.yml)
`spider-lib` is an asynchronous, concurrent web scraping library for Rust. It's designed to be a lightweight yet powerful tool for building and running scrapers for projects of any size. If you're familiar with Scrapy's architecture of Spiders, Middlewares, and Pipelines, you'll feel right at home.
## Getting Started
To use `spider-lib`, add it to your project's `Cargo.toml`:
```toml
[dependencies]
spider-lib = "0.2" # Check crates.io for the latest version
```
## Quick Example
Here's a minimal example of a spider that scrapes quotes from `quotes.toscrape.com`.
For convenience, `spider-lib` offers a prelude that re-exports the most commonly used items.
```rust
// Use the prelude for easy access to common types and traits.
use spider_lib::prelude::*;
use spider_lib::utils::ToSelector; // ToSelector is not in the prelude
#[scraped_item]
pub struct QuoteItem {
pub text: String,
pub author: String,
}
pub struct QuotesSpider;
#[async_trait]
impl Spider for QuotesSpider {
type Item = QuoteItem;
fn start_urls(&self) -> Vec<&'static str> {
vec!["http://quotes.toscrape.com/"]
}
async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
let html = response.to_html()?;
let mut output = ParseOutput::new();
for quote in html.select(&".quote".to_selector()?) {
let text = quote.select(&".text".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
let author = quote.select(&".author".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
output.add_item(QuoteItem { text, author });
}
if let Some(next_href) = html.select(&".next > a[href]".to_selector()?).next().and_then(|a| a.attr("href")) {
let next_url = response.url.join(next_href)?;
output.add_request(Request::new(next_url));
}
Ok(output)
}
}
#[tokio::main]
async fn main() -> Result<(), SpiderError> {
tracing_subscriber::fmt().with_max_level(tracing::Level::INFO).init();
// The builder defaults to using ReqwestClientDownloader
let crawler = CrawlerBuilder::<_, ReqwestClientDownloader>::new(QuotesSpider)
.build()
.await?;
crawler.start_crawl().await?;
Ok(())
}
```
## Features
* **Asynchronous & Concurrent:** `spider-lib` provides a high-performance, asynchronous web scraping framework built on `tokio`, leveraging an actor-like concurrency model for efficient task handling.
* **Graceful Shutdown:** Ensures clean termination on `Ctrl+C`, allowing in-flight tasks to complete and flushing all data.
* **Checkpoint and Resume:** Allows saving the crawler's state (scheduler, pipelines) to a file and resuming the crawl later, supporting both manual and periodic automatic saves. This includes salvaging un-processed requests.
* **Request Deduplication:** Utilizes request fingerprinting to prevent duplicate requests from being processed, ensuring efficiency and avoiding redundant work.
* **Familiar Architecture:** Leverages a modular design with Spiders, Middlewares, and Item Pipelines, drawing inspiration from Scrapy.
* **Configurable Concurrency:** Offers fine-grained control over the number of concurrent downloads, parsing workers, and pipeline processing for optimized performance.
* **Advanced Link Extraction:** Includes a powerful `Response` object method to comprehensively extract, resolve, and categorize various types of links from HTML content.
* **Fluent Configuration:** A `CrawlerBuilder` API simplifies the assembly and configuration of your web crawler.
#### Built-in Middlewares
* **Rate Limiting:** Controls request rates to prevent server overload, with support for adaptive and fixed-rate (token bucket) strategies.
* **Retries:** Automatically retries failed or timed-out requests with configurable delays.
* **User-Agent Rotation:** Manages and rotates user agents for robust scraping.
* **HTTP Caching:** Caches responses to accelerate development and reduce network load.
* **Respect Robots.txt:** Adheres to `robots.txt` rules to avoid disallowed paths.
* **Referer Management:** Handles the `Referer` header to mimic browser behavior or enforce specific policies.
#### Built-in Item Pipelines
* **Exporters:** Supports saving scraped data to `JSON`, `JSON Lines`, `CSV`, and `SQLite` formats.
* **Deduplication:** Filters out duplicate items based on a configurable key.
* **Console Writer:** Provides a simple pipeline for printing items to the console during development.
For complete, runnable examples, please refer to the `examples/` directory in this repository. You can run an example using `cargo run --example <example_name>`, for instance: `cargo run --example quotes`.
## Feature Flags
`spider-lib` uses feature flags to keep the core library lightweight while allowing for optional functionality.
* `pipeline-sqlite`: Enables the `SqliteWriterPipeline` for exporting items to a SQLite database.
To enable a feature, add it to your `Cargo.toml` dependency:
```toml
[dependencies]
spider-lib = { version = "0.2", features = ["pipeline-sqlite"] }
```