spider-lib 1.2.1

A Rust-based web scraping framework inspired by Scrapy (Python).
Documentation

spider-lib

A Rust-based web scraping framework inspired by Scrapy (Python).

Overview

spider-lib is a comprehensive web scraping framework written in Rust that takes inspiration from the popular Python Scrapy framework. It provides a robust, efficient, and extensible platform for building web crawlers and scrapers with features like asynchronous processing, middleware support, and flexible data pipelines.

Features

  • Asynchronous Processing: Built with async/await for efficient concurrent crawling
  • Modular Architecture: Clean separation of concerns with multiple specialized crates
  • Middleware Support: Extensible middleware system for customizing request/response handling
  • Flexible Pipelines: Multiple output options for scraped data (JSON, CSV, SQLite, etc.)
  • Rate Limiting: Built-in rate limiting to respect website policies
  • Retry Mechanisms: Automatic retry for failed requests
  • User-Agent Rotation: Automatic rotation of user agents
  • Cookie Management: Persistent cookie handling across requests
  • HTTP Caching: Development-friendly caching capabilities
  • Robots.txt Compliance: Automatic adherence to robots.txt rules
  • Proxy Support: Configurable proxy server usage
  • Deduplication: Built-in duplicate detection and filtering

Architecture

The framework is organized into several specialized crates:

spider-core

The core engine that provides the fundamental components for building web scrapers, including the main Crawler, Scheduler, and Spider trait. It manages the flow of requests and responses and coordinates concurrent operations.

spider-downloader

Provides traits and implementations for HTTP downloaders, abstracting the underlying HTTP client implementation for flexibility.

spider-macro

Contains procedural macros that reduce boilerplate code, particularly the #[scraped_item] macro for defining data structures for scraped content.

spider-middleware

Includes a comprehensive collection of middleware implementations that extend crawler functionality, such as rate limiting, retries, user-agent rotation, and cookie management.

spider-pipeline

Provides built-in pipeline implementations for processing, filtering, transforming, and storing scraped data in various formats (JSON, CSV, SQLite, etc.).

spider-util

Contains fundamental data structures, error types, and utility functions shared across all components of the framework.

Installation

Add this to your Cargo.toml:

[dependencies]
spider-lib = "1.1.1"
serde = { version = "1.0", features = ["derive"] }  # Required for #[scraped_item] macro
serde_json = "1.0"  # Required for #[scraped_item] macro

Note: When using the #[scraped_item] macro, you must also include serde and serde_json as direct dependencies in your project, as the macro generates code that references these crates directly.

Features

Spider-lib provides optional features for specific functionality:

Middleware Features

  • middleware-cache - Enable HTTP caching capabilities for development
  • middleware-proxy - Enable proxy rotation functionality
  • middleware-user-agent - Enable user-agent rotation
  • middleware-robots - Enable robots.txt compliance checking
  • middleware-cookies - Enable cookie management

Pipeline Features

  • pipeline-csv - Enable CSV export functionality
  • pipeline-json - Enable JSON writing functionality
  • pipeline-jsonl - Enable JSONL writing functionality
  • pipeline-sqlite - Enable SQLite database functionality
  • pipeline-stream-json - Enable stream JSON functionality

Core Features

  • checkpoint - Enable checkpoint and resume functionality
  • cookie-store - Enable advanced cookie store integration (Note: When using middleware-cookies, cookie-store should also be enabled)
  • stream - Enable stream response processing for memory-efficient handling of large responses

Important Feature Relationships

  • middleware-cookies and cookie-store are interdependent: When using middleware-cookies, cookie-store should also be enabled for full functionality
  • When using cookie-store, middleware-cookies functionality may be desired for managing cookies effectively
  • stream feature enables memory-efficient processing of large responses without loading entire body into memory

By default, only core functionality is included. You can enable specific features as needed:

[dependencies]
spider-lib = { version = "0.5.1", features = ["middleware-cache", "pipeline-csv"] }

Or disable default features and enable only what you need:

[dependencies]
spider-lib = { version = "0.5.1", default-features = false, features = ["core"] }

Usage

Here's a basic example of how to use the framework:

use spider_lib::prelude::*;

#[derive(Default)]
struct MySpider;

#[scraped_item]
struct MyItem {
    title: String,
    url: String,
}

#[async_trait]
impl Spider for MySpider {
    type Item = MyItem;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://example.com"]
    }

    async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
        // Custom parsing logic here
        todo!()
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(MySpider).build().await?;
    crawler.start_crawl().await
}

Important: Make sure to import the prelude with use spider_lib::prelude::*; to bring the necessary items into scope for the macro to work properly.

Contributing

We welcome contributions to the spider-lib project! Please see our CONTRIBUTING.md for guidelines on how to contribute.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Stream Response Feature

The stream feature enables memory-efficient processing of large responses without loading the entire body into memory at once. This is particularly useful when scraping large web pages or when memory usage is a concern.

To use the stream feature:

[dependencies]
spider-lib = { version = "1.2.0", features = ["stream"] }

With the stream feature enabled, you can implement both parse and parse_stream methods in your spider:

use spider_lib::prelude::*;

#[scraped_item]
struct MyItem {
    content: String,
}

struct MySpider;

#[async_trait]
impl Spider for MySpider {
    type Item = MyItem;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://example.com/large-page"]
    }

    async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
        // Traditional parsing - entire response body loaded into memory
        // ... parsing logic ...
        todo!()
    }

    #[cfg(feature = "stream")]
    async fn parse_stream(&mut self, response: StreamResponse) -> Result<ParseOutput<Self::Item>, SpiderError> {
        // Stream parsing - processes response without loading entire body
        // ... stream parsing logic ...
        todo!()
    }
}

The framework will automatically use the appropriate method based on the response type and feature configuration.

Documentation

For more detailed documentation, visit https://docs.rs/spider-lib