spider-middleware
Provides built-in middleware implementations for the spider-lib framework.
Overview
The spider-middleware crate contains a comprehensive collection of middleware implementations that extend the functionality of web crawlers. Middlewares intercept and process requests and responses, enabling features like rate limiting, retries, user-agent rotation, and more.
Available Middlewares
- Rate Limiting: Controls request rates to prevent server overload
- Retries: Automatically retries failed or timed-out requests
- User-Agent Rotation: Manages and rotates user agents
- Referer Management: Handles the
Refererheader - Cookies: Persists cookies across requests to maintain sessions
- HTTP Caching: Caches responses to accelerate development
- Robots.txt: Adheres to
robots.txtrules - Proxy: Manages and rotates proxy servers
Architecture
Each middleware implements the Middleware trait, allowing them to intercept requests before they're sent and responses after they're received. This enables flexible, composable behavior customization for crawlers.
Usage
use RateLimitMiddleware;
use RetryMiddleware;
// Add middlewares to your crawler
let crawler = new
.add_middleware
.add_middleware
.build
.await?;
Middleware Types
Rate Limiting
Controls the frequency of requests to respect server resources and avoid being blocked. The RateLimitMiddleware offers two different rate limiting algorithms:
Adaptive Limiter (Default)
Dynamically adjusts delays based on response status codes. Increases delay on errors (429, 5xx) and decreases on successful responses.
Configuration:
use ;
let rate_limit_middleware = builder
.scope // Apply rate limits per domain (or Scope::Global)
.limiter // Initial delay of 500ms with jitter
.build;
Token Bucket Limiter
Enforces a fixed requests-per-second rate regardless of response status.
Configuration:
use RateLimitMiddleware;
let rate_limit_middleware = builder
.use_token_bucket_limiter // 2 requests per second
.build;
Retries
Automatically retries failed requests with configurable backoff strategies.
Configuration:
use RetryMiddleware;
use Duration;
let retry_middleware = new
.max_retries // Maximum 3 retry attempts
.retry_http_codes // Status codes to retry
.backoff_factor // Backoff factor for exponential backoff
.max_delay; // Maximum delay between retries
User-Agent Rotation
Rotates user agent strings to avoid detection and blocking. Supports multiple rotation strategies and sources.
Configuration:
use ;
use PathBuf;
// Using built-in user agents
let user_agent_middleware = builder
.source
.strategy
.build?;
// Using custom list
let user_agent_middleware = builder
.source
.strategy
.build?;
// Using file source
let mut file_path = new;
file_path.push;
let user_agent_middleware = builder
.source
.strategy
.session_duration // 5 minutes for sticky session
.build?;
Referer Management
Handles the referer header appropriately for requests to simulate natural browsing behavior.
Configuration:
use RefererMiddleware;
let referer_middleware = new
.same_origin_only // Only use referers from the same origin
.max_chain_length // Maximum number of referers to keep in memory
.include_fragment; // Exclude URL fragments from referer
Cookies
Manages cookies across requests to maintain sessions.
Configuration:
use CookieMiddleware;
// Basic usage
let cookie_middleware = new;
// Loading from JSON file
let cookie_middleware = from_json.await?;
// Loading from Netscape cookie file
let cookie_middleware = from_netscape_file.await?;
// Loading from RFC6265 format
let cookie_middleware = from_rfc6265.await?;
HTTP Caching
Caches responses locally to speed up development and reduce server load.
Configuration:
use HttpCacheMiddleware;
use PathBuf;
let mut cache_dir = new;
cache_dir.push;
let http_cache_middleware = builder
.cache_dir
.build?;
Robots.txt
Ensures compliance with robots.txt rules.
Configuration:
use RobotsTxtMiddleware;
use Duration;
let robots_txt_middleware = new
.cache_ttl // Cache TTL: 24 hours
.cache_capacity // Max cache entries
.request_timeout; // Timeout for fetching robots.txt
Proxy
Manages proxy servers for requests to avoid IP-based blocking.
Configuration:
use ;
use PathBuf;
// Using custom list
let proxy_middleware = builder
.source
.strategy
.build?;
// Using file source
let mut file_path = new;
file_path.push;
let proxy_middleware = builder
.source
.strategy
.build?;
// Sticky failover strategy with block detection
let proxy_middleware = builder
.source
.strategy
.with_block_detection_texts
.build?;
Dependencies
This crate depends on:
spider-util: For request and response data structures- Various external crates for specific functionality (governor for rate limiting, reqwest for HTTP operations, etc.)
License
This project is licensed under the MIT License - see the LICENSE file for details.