spider-lib
A Rust-based web scraping framework inspired by Scrapy.
spider-lib is an asynchronous, concurrent web scraping library for Rust. It's designed to be a lightweight yet powerful tool for building and running scrapers for projects of any size. If you're familiar with Scrapy's architecture of Spiders, Middlewares, and Pipelines, you'll feel right at home.
Getting Started
To use spider-lib, add it to your project's Cargo.toml:
[]
= "0.2" # Check crates.io for the latest version
Quick Example
Here's a minimal example of a spider that scrapes quotes from quotes.toscrape.com.
For convenience, spider-lib offers a prelude that re-exports the most commonly used items.
// Use the prelude for easy access to common types and traits.
use *;
use ToSelector; // ToSelector is not in the prelude
;
async
Features
- Asynchronous & Concurrent:
spider-libprovides a high-performance, asynchronous web scraping framework built ontokio, leveraging an actor-like concurrency model for efficient task handling. - Graceful Shutdown: Ensures clean termination on
Ctrl+C, allowing in-flight tasks to complete and flushing all data. - Checkpoint and Resume: Allows saving the crawler's state (scheduler, pipelines) to a file and resuming the crawl later, supporting both manual and periodic automatic saves. This includes salvaging un-processed requests.
- Request Deduplication: Utilizes request fingerprinting to prevent duplicate requests from being processed, ensuring efficiency and avoiding redundant work.
- Familiar Architecture: Leverages a modular design with Spiders, Middlewares, and Item Pipelines, drawing inspiration from Scrapy.
- Configurable Concurrency: Offers fine-grained control over the number of concurrent downloads, parsing workers, and pipeline processing for optimized performance.
- Advanced Link Extraction: Includes a powerful
Responseobject method to comprehensively extract, resolve, and categorize various types of links from HTML content. - Fluent Configuration: A
CrawlerBuilderAPI simplifies the assembly and configuration of your web crawler.
Built-in Middlewares
- Rate Limiting: Controls request rates to prevent server overload, with support for adaptive and fixed-rate (token bucket) strategies.
- Retries: Automatically retries failed or timed-out requests with configurable delays.
- User-Agent Rotation: Manages and rotates user agents for robust scraping.
- HTTP Caching: Caches responses to accelerate development and reduce network load.
- Respect Robots.txt: Adheres to
robots.txtrules to avoid disallowed paths. - Referer Management: Handles the
Refererheader to mimic browser behavior or enforce specific policies.
Built-in Item Pipelines
- Exporters: Supports saving scraped data to
JSON,JSON Lines,CSV, andSQLiteformats. - Deduplication: Filters out duplicate items based on a configurable key.
- Console Writer: Provides a simple pipeline for printing items to the console during development.
For complete, runnable examples, please refer to the examples/ directory in this repository. You can run an example using cargo run --example <example_name>, for instance: cargo run --example quotes.
Feature Flags
spider-lib uses feature flags to keep the core library lightweight while allowing for optional functionality.
pipeline-sqlite: Enables theSqliteWriterPipelinefor exporting items to a SQLite database.
To enable a feature, add it to your Cargo.toml dependency:
[]
= { = "0.2", = ["pipeline-sqlite"] }