spider-lib
A Rust-based web scraping framework inspired by Scrapy (Python).
Overview
spider-lib is a comprehensive web scraping framework written in Rust that takes inspiration from the popular Python Scrapy framework. It provides a robust, efficient, and extensible platform for building web crawlers and scrapers with features like asynchronous processing, middleware support, and flexible data pipelines.
Features
- Asynchronous Processing: Built with async/await for efficient concurrent crawling
- Modular Architecture: Clean separation of concerns with multiple specialized crates
- Middleware Support: Extensible middleware system for customizing request/response handling
- Flexible Pipelines: Multiple output options for scraped data (JSON, CSV, SQLite, etc.)
- Rate Limiting: Built-in rate limiting to respect website policies
- Retry Mechanisms: Automatic retry for failed requests
- User-Agent Rotation: Automatic rotation of user agents
- Cookie Management: Persistent cookie handling across requests
- HTTP Caching: Development-friendly caching capabilities
- Robots.txt Compliance: Automatic adherence to robots.txt rules
- Proxy Support: Configurable proxy server usage
- Deduplication: Built-in duplicate detection and filtering
Architecture
The framework is organized into several specialized crates:
spider-core
The core engine that provides the fundamental components for building web scrapers, including the main Crawler, Scheduler, and Spider trait. It manages the flow of requests and responses and coordinates concurrent operations.
spider-downloader
Provides traits and implementations for HTTP downloaders, abstracting the underlying HTTP client implementation for flexibility.
spider-macro
Contains procedural macros that reduce boilerplate code, particularly the #[scraped_item] macro for defining data structures for scraped content.
spider-middleware
Includes a comprehensive collection of middleware implementations that extend crawler functionality, such as rate limiting, retries, user-agent rotation, and cookie management.
spider-pipeline
Provides built-in pipeline implementations for processing, filtering, transforming, and storing scraped data in various formats (JSON, CSV, SQLite, etc.).
spider-util
Contains fundamental data structures, error types, and utility functions shared across all components of the framework.
Installation
Add this to your Cargo.toml:
[]
= "1.1.1"
= { = "1.0", = ["derive"] } # Required for #[scraped_item] macro
= "1.0" # Required for #[scraped_item] macro
Note: When using the #[scraped_item] macro, you must also include serde and serde_json as direct dependencies in your project, as the macro generates code that references these crates directly.
Features
Spider-lib provides optional features for specific functionality:
Middleware Features
middleware-cache- Enable HTTP caching capabilities for developmentmiddleware-proxy- Enable proxy rotation functionalitymiddleware-user-agent- Enable user-agent rotationmiddleware-robots- Enable robots.txt compliance checkingmiddleware-cookies- Enable cookie management
Pipeline Features
pipeline-csv- Enable CSV export functionalitypipeline-json- Enable JSON writing functionalitypipeline-jsonl- Enable JSONL writing functionalitypipeline-sqlite- Enable SQLite database functionalitypipeline-stream-json- Enable stream JSON functionality
Core Features
checkpoint- Enable checkpoint and resume functionalitycookie-store- Enable advanced cookie store integration (Note: When usingmiddleware-cookies,cookie-storeshould also be enabled)stream- Enable stream response processing for memory-efficient handling of large responses
Important Feature Relationships
middleware-cookiesandcookie-storeare interdependent: When usingmiddleware-cookies,cookie-storeshould also be enabled for full functionality- When using
cookie-store,middleware-cookiesfunctionality may be desired for managing cookies effectively streamfeature enables memory-efficient processing of large responses without loading entire body into memory
By default, only core functionality is included. You can enable specific features as needed:
[]
= { = "0.5.1", = ["middleware-cache", "pipeline-csv"] }
Or disable default features and enable only what you need:
[]
= { = "0.5.1", = false, = ["core"] }
Usage
Here's a basic example of how to use the framework:
use *;
;
async
Important: Make sure to import the prelude with use spider_lib::prelude::*; to bring the necessary items into scope for the macro to work properly.
Contributing
We welcome contributions to the spider-lib project! Please see our CONTRIBUTING.md for guidelines on how to contribute.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Stream Response Feature
The stream feature enables memory-efficient processing of large responses without loading the entire body into memory at once. This is particularly useful when scraping large web pages or when memory usage is a concern.
To use the stream feature:
[]
= { = "1.2.0", = ["stream"] }
With the stream feature enabled, you can implement both parse and parse_stream methods in your spider:
use *;
;
The framework will automatically use the appropriate method based on the response type and feature configuration.
Documentation
For more detailed documentation, visit https://docs.rs/spider-lib