spider-lib

A Rust-based web scraping framework inspired by Scrapy (Python).

Overview

spider-lib is a comprehensive web scraping framework written in Rust that takes inspiration from the popular Python Scrapy framework. It provides a robust, efficient, and extensible platform for building web crawlers and scrapers with features like asynchronous processing, middleware support, and flexible data pipelines.

Features

Asynchronous Processing: Built with async/await for efficient concurrent crawling
Modular Architecture: Clean separation of concerns with multiple specialized crates
Middleware Support: Extensible middleware system for customizing request/response handling
Flexible Pipelines: Multiple output options for scraped data (JSON, CSV, SQLite, etc.)
Rate Limiting: Built-in rate limiting to respect website policies
Retry Mechanisms: Automatic retry for failed requests
User-Agent Rotation: Automatic rotation of user agents
Cookie Management: Persistent cookie handling across requests
HTTP Caching: Development-friendly caching capabilities
Robots.txt Compliance: Automatic adherence to robots.txt rules
Proxy Support: Configurable proxy server usage
Deduplication: Built-in duplicate detection and filtering

Architecture

The framework is organized into several specialized crates:

spider-core

The core engine that provides the fundamental components for building web scrapers, including the main Crawler, Scheduler, and Spider trait. It manages the flow of requests and responses and coordinates concurrent operations.

spider-downloader

Provides traits and implementations for HTTP downloaders, abstracting the underlying HTTP client implementation for flexibility.

spider-macro

Contains procedural macros that reduce boilerplate code, particularly the #[scraped_item] macro for defining data structures for scraped content.

spider-middleware

Includes a comprehensive collection of middleware implementations that extend crawler functionality, such as rate limiting, retries, user-agent rotation, and cookie management.

spider-pipeline

Provides built-in pipeline implementations for processing, filtering, transforming, and storing scraped data in various formats (JSON, CSV, SQLite, etc.).

spider-util

Contains fundamental data structures, error types, and utility functions shared across all components of the framework.

Installation

Add this to your Cargo.toml:

[dependencies]
spider-lib = "0.5.1"

Usage

Here's a basic example of how to use the framework:

use spider_lib::{Crawler, CrawlerBuilder, Spider, SpiderError};
use spider_lib::prelude::*;

#[derive(Default)]
struct MySpider;

#[spider_macro::scraped_item]
struct MyItem {
    title: String,
    url: String,
}

#[async_trait::async_trait]
impl Spider for MySpider {
    type Item = MyItem;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://example.com"]
    }

    async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
        // Custom parsing logic here
        todo!()
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(MySpider).build().await?;
    crawler.start_crawl().await
}

Contributing

We welcome contributions to the spider-lib project! Please see our CONTRIBUTING.md for guidelines on how to contribute.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Documentation

For more detailed documentation, visit https://docs.rs/spider-lib

spider-lib 1.1.0