Crate spider_lib

Expand description

§spider-lib

A Rust-based web scraping framework inspired by Scrapy.

spider-lib is an asynchronous, concurrent web scraping library for Rust. It’s designed to be a lightweight yet powerful tool for building and running scrapers for projects of any size. If you’re familiar with Scrapy’s architecture of Spiders, Middlewares, and Pipelines, you’ll feel right at home.

§Getting Started

To use spider-lib, add it to your project’s Cargo.toml:

[dependencies]
spider-lib = "0.2" # Check crates.io for the latest version

§Quick Example

Here’s a minimal example of a spider that scrapes quotes from quotes.toscrape.com.

For convenience, spider-lib offers a prelude that re-exports the most commonly used items.

// Use the prelude for easy access to common types and traits.
use spider_lib::prelude::*;
use spider_lib::utils::ToSelector; // ToSelector is not in the prelude

#[scraped_item]
pub struct QuoteItem {
    pub text: String,
    pub author: String,
}

pub struct QuotesSpider;

#[async_trait]
impl Spider for QuotesSpider {
    type Item = QuoteItem;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["http://quotes.toscrape.com/"]
    }

    async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
        let html = response.to_html()?;
        let mut output = ParseOutput::new();

        for quote in html.select(&".quote".to_selector()?) {
            let text = quote.select(&".text".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
            let author = quote.select(&".author".to_selector()?).next().map(|e| e.text().collect()).unwrap_or_default();
            output.add_item(QuoteItem { text, author });
        }

        if let Some(next_href) = html.select(&".next > a[href]".to_selector()?).next().and_then(|a| a.attr("href")) {
            let next_url = response.url.join(next_href)?;
            output.add_request(Request::new(next_url));
        }

        Ok(output)
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    tracing_subscriber::fmt().with_max_level(tracing::Level::INFO).init();

    // The builder defaults to using ReqwestClientDownloader
    let crawler = CrawlerBuilder::<_, ReqwestClientDownloader>::new(QuotesSpider)
        .build()
        .await?;

    crawler.start_crawl().await?;

    Ok(())
}

§Features

Asynchronous & Concurrent: spider-lib provides a high-performance, asynchronous web scraping framework built on tokio, leveraging an actor-like concurrency model for efficient task handling.
Graceful Shutdown: Ensures clean termination on Ctrl+C, allowing in-flight tasks to complete and flushing all data.
Checkpoint and Resume: Allows saving the crawler’s state (scheduler, pipelines) to a file and resuming the crawl later, supporting both manual and periodic automatic saves. This includes salvaging un-processed requests.
Request Deduplication: Utilizes request fingerprinting to prevent duplicate requests from being processed, ensuring efficiency and avoiding redundant work.
Familiar Architecture: Leverages a modular design with Spiders, Middlewares, and Item Pipelines, drawing inspiration from Scrapy.
Configurable Concurrency: Offers fine-grained control over the number of concurrent downloads, parsing workers, and pipeline processing for optimized performance.
Advanced Link Extraction: Includes a powerful Response object method to comprehensively extract, resolve, and categorize various types of links from HTML content.
Fluent Configuration: A CrawlerBuilder API simplifies the assembly and configuration of your web crawler.

§Built-in Middlewares

The following middlewares are included by default:

Rate Limiting: Controls request rates to prevent server overload.
Retries: Automatically retries failed or timed-out requests.
User-Agent Rotation: Manages and rotates user agents.
Referer Management: Handles the Referer header.

Additional middlewares are available via feature flags:

HTTP Caching: Caches responses to accelerate development (middleware-http-cache).
Respect Robots.txt: Adheres to robots.txt rules (middleware-robots-txt).

§Built-in Item Pipelines

The following pipelines are included by default:

Deduplication: Filters out duplicate items based on a configurable key.
Console Writer: A simple pipeline for printing items to the console.

Exporter pipelines are available via feature flags:

JSON / JSON Lines: Saves items to .json or .jsonl files (pipeline-json).
CSV: Saves items to .csv files (pipeline-csv).
SQLite: Saves items to a SQLite database (pipeline-sqlite).

For complete, runnable examples, please refer to the examples/ directory in this repository. You can run an example using cargo run --example <example_name> --features <features>, for instance: cargo run --example quotes --features "pipeline-json".

§Feature Flags

spider-lib uses feature flags to keep the core library lightweight while allowing for optional functionality. To use a feature, add it to your Cargo.toml.

pipeline-json: Enables JsonWriterPipeline and JsonlWriterPipeline.
pipeline-csv: Enables CsvExporterPipeline.
pipeline-sqlite: Enables SqliteWriterPipeline.
middleware-http-cache: Enables HttpCacheMiddleware for disk-based response caching.
middleware-robots-txt: Enables RobotsTxtMiddleware for respecting robots.txt files.
checkpoint: Enables crawler checkpointing for saving and resuming crawls.

Example of enabling multiple features:

[dependencies]
spider-lib = { version = "0.3", features = ["pipeline-json", "middleware-http-cache"] }

Re-exports§

pub use downloader::Downloader;
pub use middleware::Middleware;
pub use pipeline::Pipeline;
pub use builder::CrawlerBuilder;
pub use crawler::Crawler;
pub use error::PipelineError;
pub use error::SpiderError;
pub use item::ParseOutput;
pub use item::ScrapedItem;
pub use request::Request;
pub use response::Response;
pub use scheduler::Scheduler;
pub use spider::Spider;
pub use downloaders::reqwest_client::ReqwestClientDownloader;
pub use middlewares::rate_limit::RateLimitMiddleware;
pub use middlewares::referer::RefererMiddleware;
pub use middlewares::retry::RetryMiddleware;
pub use middlewares::user_agent::UserAgentMiddleware;
pub use pipelines::console_writer::ConsoleWriterPipeline;
pub use pipelines::deduplication::DeduplicationPipeline;
pub use tokio;

Modules§

builder: Builder for constructing and configuring the Crawler instance.
crawler: The core Crawler implementation for the spider-lib framework.
downloader: Traits for defining and implementing HTTP downloaders in spider-lib.
downloaders: Module for spider-lib downloader implementations.
error: Custom error types for the spider-lib framework.
item: Data structures for scraped items and spider output in spider-lib.
middleware: Core Middleware trait and related types for the spider-lib framework.
middlewares: Module for spider-lib middleware implementations.
pipeline: Trait for defining item processing pipelines in spider-lib.
pipelines: Module for spider-lib item pipeline implementations.
prelude: A “prelude” for users of the spider-lib crate.
request: Data structures for representing HTTP requests in spider-lib.
response: Data structures and utilities for handling HTTP responses in spider-lib.
scheduler: Request Scheduler for managing the crawling frontier.
spider: Trait for defining custom web spiders in the spider-lib framework.
state: Module for tracking the operational state of the crawler.
utils: General utility functions and helper traits for the spider-lib framework.

Structs§

DashMap: DashMap is an implementation of a concurrent associative array/hashmap in Rust.
SchedulerCheckpoint

Attribute Macros§

async_trait
scraped_item: A procedural macro to derive the ScrapedItem trait.

Crate spider_lib

Crate spider_lib Copy item path

§spider-lib

§Getting Started

§Quick Example

§Features

§Built-in Middlewares

§Built-in Item Pipelines

§Feature Flags

Re-exports§

Modules§

Structs§

Attribute Macros§

Crate spider_lib