spider-middleware
Provides built-in middleware implementations for the spider-lib framework.
Overview
The spider-middleware crate contains a comprehensive collection of middleware implementations that extend the functionality of web crawlers. Middlewares intercept and process requests and responses, enabling features like rate limiting, retries, user-agent rotation, and more.
Middlewares are organized using feature flags to prevent bloat. Core middlewares are always available, while advanced features can be enabled as needed.
Available Middlewares
Core Middlewares (Always Available)
- Rate Limiting: Controls request rates to prevent server overload
- Retries: Automatically retries failed or timed-out requests
- Referer Management: Handles the
Refererheader
Optional Middlewares (Feature-Gated)
- User-Agent Rotation: Manages and rotates user agents (feature:
middleware-user-agent) - Cookies: Persists cookies across requests to maintain sessions (feature:
middleware-cookies) - HTTP Caching: Caches responses to accelerate development (feature:
middleware-cache) - Robots.txt: Adheres to
robots.txtrules (feature:middleware-robots) - Proxy: Manages and rotates proxy servers (feature:
middleware-proxy)
Features
This crate uses feature flags to allow selective inclusion of middleware components:
core(default): Includes core middleware functionalitymiddleware-cache: Enables HTTP caching capabilitiesmiddleware-proxy: Enables proxy rotation functionalitymiddleware-user-agent: Enables user-agent rotationmiddleware-robots: Enables robots.txt compliance checkingmiddleware-cookies: Enables cookie management (Note: Requirescookie-storefeature inspider-corefor full functionality)
Important Feature Relationships
middleware-cookiesandcookie-store(from spider-core) are interdependent: When usingmiddleware-cookies,cookie-storeshould also be enabled in spider-core for full functionality
To use only core functionality:
[]
= { = "...", = false, = ["core"] }
To include specific middleware:
[]
= { = "...", = ["middleware-cache", "middleware-proxy"] }
Architecture
Each middleware implements the Middleware trait, allowing them to intercept requests before they're sent and responses after they're received. This enables flexible, composable behavior customization for crawlers.
Usage
use RateLimitMiddleware;
use RetryMiddleware;
// Add middlewares to your crawler
let crawler = new
.add_middleware
.add_middleware
.build
.await?;
Middleware Types
Rate Limiting
Controls the frequency of requests to respect server resources and avoid being blocked. The RateLimitMiddleware offers two different rate limiting algorithms:
Adaptive Limiter (Default)
Dynamically adjusts delays based on response status codes. Increases delay on errors (429, 5xx) and decreases on successful responses.
Configuration:
use ;
let rate_limit_middleware = builder
.scope // Apply rate limits per domain (or Scope::Global)
.limiter // Initial delay of 500ms with jitter
.build;
Token Bucket Limiter
Enforces a fixed requests-per-second rate regardless of response status.
Configuration:
use RateLimitMiddleware;
let rate_limit_middleware = builder
.use_token_bucket_limiter // 2 requests per second
.build;
Retries
Automatically retries failed requests with configurable backoff strategies.
Configuration:
use RetryMiddleware;
use Duration;
let retry_middleware = new
.max_retries // Maximum 3 retry attempts
.retry_http_codes // Status codes to retry
.backoff_factor // Backoff factor for exponential backoff
.max_delay; // Maximum delay between retries
User-Agent Rotation
Rotates user agent strings to avoid detection and blocking. Supports multiple rotation strategies and sources.
Configuration:
use ;
use PathBuf;
// Using built-in user agents
let user_agent_middleware = builder
.source
.strategy
.build?;
// Using custom list
let user_agent_middleware = builder
.source
.strategy
.build?;
// Using file source
let mut file_path = new;
file_path.push;
let user_agent_middleware = builder
.source
.strategy
.session_duration // 5 minutes for sticky session
.build?;
Referer Management
Handles the referer header appropriately for requests to simulate natural browsing behavior.
Configuration:
use RefererMiddleware;
let referer_middleware = new
.same_origin_only // Only use referers from the same origin
.max_chain_length // Maximum number of referers to keep in memory
.include_fragment; // Exclude URL fragments from referer
Cookies
Manages cookies across requests to maintain sessions.
Configuration:
use CookieMiddleware;
// Basic usage
let cookie_middleware = new;
// Loading from JSON file
let cookie_middleware = from_json.await?;
// Loading from Netscape cookie file
let cookie_middleware = from_netscape_file.await?;
// Loading from RFC6265 format
let cookie_middleware = from_rfc6265.await?;
HTTP Caching
Caches responses locally to speed up development and reduce server load.
Configuration:
use HttpCacheMiddleware;
use PathBuf;
let mut cache_dir = new;
cache_dir.push;
let http_cache_middleware = builder
.cache_dir
.build?;
Robots.txt
Ensures compliance with robots.txt rules.
Configuration:
use RobotsTxtMiddleware;
use Duration;
let robots_txt_middleware = new
.cache_ttl // Cache TTL: 24 hours
.cache_capacity // Max cache entries
.request_timeout; // Timeout for fetching robots.txt
Proxy
Manages proxy servers for requests to avoid IP-based blocking.
Configuration:
use ;
use PathBuf;
// Using custom list
let proxy_middleware = builder
.source
.strategy
.build?;
// Using file source
let mut file_path = new;
file_path.push;
let proxy_middleware = builder
.source
.strategy
.build?;
// Sticky failover strategy with block detection
let proxy_middleware = builder
.source
.strategy
.with_block_detection_texts
.build?;
License
This project is licensed under the MIT License - see the LICENSE file for details.