spider-lib 1.3.0

A Rust-based web scraping framework inspired by Scrapy (Python).
Documentation

spider-lib

A Rust-based web scraping framework inspired by Scrapy (Python).

Overview

spider-lib is a comprehensive web scraping framework written in Rust that takes inspiration from the popular Python Scrapy framework. It provides a robust, efficient, and extensible platform for building web crawlers and scrapers with features like asynchronous processing, middleware support, and flexible data pipelines.

Features

  • Asynchronous Processing: Built with async/await for efficient concurrent crawling
  • Modular Architecture: Clean separation of concerns with multiple specialized crates
  • Middleware Support: Extensible middleware system for customizing request/response handling
  • Flexible Pipelines: Multiple output options for scraped data (JSON, CSV, SQLite, etc.)
  • Rate Limiting: Built-in rate limiting to respect website policies
  • Retry Mechanisms: Automatic retry for failed requests
  • User-Agent Rotation: Automatic rotation of user agents
  • Cookie Management: Persistent cookie handling across requests
  • HTTP Caching: Development-friendly caching capabilities
  • Robots.txt Compliance: Automatic adherence to robots.txt rules
  • Proxy Support: Configurable proxy server usage
  • Deduplication: Built-in duplicate detection and filtering

Architecture

The framework is organized into several specialized crates:

spider-core

The core engine that provides the fundamental components for building web scrapers, including the main Crawler, Scheduler, and Spider trait. It manages the flow of requests and responses and coordinates concurrent operations.

spider-downloader

Provides traits and implementations for HTTP downloaders, abstracting the underlying HTTP client implementation for flexibility.

spider-macro

Contains procedural macros that reduce boilerplate code, particularly the #[scraped_item] macro for defining data structures for scraped content.

spider-middleware

Includes a comprehensive collection of middleware implementations that extend crawler functionality, such as rate limiting, retries, user-agent rotation, and cookie management.

spider-pipeline

Provides built-in pipeline implementations for processing, filtering, transforming, and storing scraped data. Includes pipelines for console output, deduplication, and various export formats (JSON, CSV, SQLite, JSONL, Stream JSON).

spider-util

Contains fundamental data structures, error types, and utility functions shared across all components of the framework.

Installation

Add this to your Cargo.toml:

[dependencies]
spider-lib = "1.1.1"
serde = { version = "1.0", features = ["derive"] }  # Required for #[scraped_item] macro
serde_json = "1.0"  # Required for #[scraped_item] macro

Note: When using the #[scraped_item] macro, you must also include serde and serde_json as direct dependencies in your project, as the macro generates code that references these crates directly.

Features

Spider-lib provides optional features for specific functionality:

Middleware Features

  • middleware-cache - Enable HTTP caching capabilities for development
  • middleware-proxy - Enable proxy rotation functionality
  • middleware-user-agent - Enable user-agent rotation
  • middleware-robots - Enable robots.txt compliance checking
  • middleware-cookies - Enable cookie management

Pipeline Features

  • pipeline-csv - Enable CSV export functionality
  • pipeline-json - Enable JSON writing functionality
  • pipeline-jsonl - Enable JSONL writing functionality
  • pipeline-sqlite - Enable SQLite database functionality
  • pipeline-stream-json - Enable stream JSON functionality

Core Features

  • checkpoint - Enable checkpoint and resume functionality
  • cookie-store - Enable advanced cookie store integration (Note: When using middleware-cookies, cookie-store should also be enabled)

Important Feature Relationships

  • middleware-cookies and cookie-store are interdependent: When using middleware-cookies, cookie-store should also be enabled for full functionality
  • When using cookie-store, middleware-cookies functionality may be desired for managing cookies effectively

By default, only core functionality is included. You can enable specific features as needed:

[dependencies]
spider-lib = { version = "0.5.1", features = ["middleware-cache", "pipeline-csv"] }

Or disable default features and enable only what you need:

[dependencies]
spider-lib = { version = "0.5.1", default-features = false, features = ["core"] }

Usage

Here's a basic example of how to use the framework:

use spider_lib::prelude::*;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
use dashmap::DashMap;

#[derive(Default)]
struct MySpider;

// State untuk tracking informasi selama crawling
#[derive(Clone, Default)]
struct MySpiderState {
    page_count: Arc<AtomicUsize>,
    visited_urls: Arc<DashMap<String, bool>>,
}

impl MySpiderState {
    fn increment_page_count(&self) {
        self.page_count.fetch_add(1, Ordering::SeqCst);
    }
    
    fn mark_url_visited(&self, url: String) {
        self.visited_urls.insert(url, true);
    }
}

#[scraped_item]
struct MyItem {
    title: String,
    url: String,
}

#[async_trait]
impl Spider for MySpider {
    type Item = MyItem;
    type State = MySpiderState;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://example.com"]
    }

    async fn parse(&self, response: Response, state: &Self::State) -> Result<ParseOutput<Self::Item>, SpiderError> {
        // Update state - bisa dilakukan secara concurrent tanpa blocking spider
        state.increment_page_count();
        state.mark_url_visited(response.url.to_string());
        
        // Custom parsing logic here
        todo!()
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(MySpider).build().await?;
    crawler.start_crawl().await
}

Note: Perhatikan bahwa implementasi Spider sekarang menggunakan referensi immutable (&self) dan menerima parameter state terpisah (state: &Self::State). Ini memungkinkan concurrent crawling yang lebih efisien karena menghilangkan kebutuhan akan mutex pada spider itu sendiri.

Important: Make sure to import the prelude with use spider_lib::prelude::*; to bring the necessary items into scope for the macro to work properly.

Contributing

We welcome contributions to the spider-lib project! Please see our CONTRIBUTING.md for guidelines on how to contribute.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Documentation

For more detailed documentation, visit https://docs.rs/spider-lib