spider-middleware 0.1.4

Middleware implementations for the spider-lib web scraping framework.
Documentation

spider-middleware

Provides built-in middleware implementations for the spider-lib framework.

Overview

The spider-middleware crate contains a comprehensive collection of middleware implementations that extend the functionality of web crawlers. Middlewares intercept and process requests and responses, enabling features like rate limiting, retries, user-agent rotation, and more.

Middlewares are organized using feature flags to prevent bloat. Core middlewares are always available, while advanced features can be enabled as needed.

Available Middlewares

Core Middlewares (Always Available)

  • Rate Limiting: Controls request rates to prevent server overload
  • Retries: Automatically retries failed or timed-out requests
  • Referer Management: Handles the Referer header

Optional Middlewares (Feature-Gated)

  • User-Agent Rotation: Manages and rotates user agents (feature: middleware-user-agent)
  • Cookies: Persists cookies across requests to maintain sessions (feature: middleware-cookies)
  • HTTP Caching: Caches responses to accelerate development (feature: middleware-cache)
  • Robots.txt: Adheres to robots.txt rules (feature: middleware-robots)
  • Proxy: Manages and rotates proxy servers (feature: middleware-proxy)

Features

This crate uses feature flags to allow selective inclusion of middleware components:

  • core (default): Includes core middleware functionality
  • middleware-cache: Enables HTTP caching capabilities
  • middleware-proxy: Enables proxy rotation functionality
  • middleware-user-agent: Enables user-agent rotation
  • middleware-robots: Enables robots.txt compliance checking
  • middleware-cookies: Enables cookie management (Note: Requires cookie-store feature in spider-core for full functionality)

Important Feature Relationships

  • middleware-cookies and cookie-store (from spider-core) are interdependent: When using middleware-cookies, cookie-store should also be enabled in spider-core for full functionality

To use only core functionality:

[dependencies]
spider-middleware = { version = "...", default-features = false, features = ["core"] }

To include specific middleware:

[dependencies]
spider-middleware = { version = "...", features = ["middleware-cache", "middleware-proxy"] }

Architecture

Each middleware implements the Middleware trait, allowing them to intercept requests before they're sent and responses after they're received. This enables flexible, composable behavior customization for crawlers.

Usage

use spider_middleware::rate_limit::RateLimitMiddleware;
use spider_middleware::retry::RetryMiddleware;

// Add middlewares to your crawler
let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(RateLimitMiddleware::default())
    .add_middleware(RetryMiddleware::new())
    .build()
    .await?;

Middleware Types

Rate Limiting

Controls the frequency of requests to respect server resources and avoid being blocked. The RateLimitMiddleware offers two different rate limiting algorithms:

Adaptive Limiter (Default)

Dynamically adjusts delays based on response status codes. Increases delay on errors (429, 5xx) and decreases on successful responses.

Configuration:

use spider_middleware::rate_limit::{RateLimitMiddleware, Scope};

let rate_limit_middleware = RateLimitMiddleware::builder()
    .scope(Scope::Domain)  // Apply rate limits per domain (or Scope::Global)
    .limiter(AdaptiveLimiter::new(Duration::from_millis(500), true))  // Initial delay of 500ms with jitter
    .build();

Token Bucket Limiter

Enforces a fixed requests-per-second rate regardless of response status.

Configuration:

use spider_middleware::rate_limit::RateLimitMiddleware;

let rate_limit_middleware = RateLimitMiddleware::builder()
    .use_token_bucket_limiter(2)  // 2 requests per second
    .build();

Retries

Automatically retries failed requests with configurable backoff strategies.

Configuration:

use spider_middleware::retry::RetryMiddleware;
use std::time::Duration;

let retry_middleware = RetryMiddleware::new()
    .max_retries(3)  // Maximum 3 retry attempts
    .retry_http_codes(vec![500, 502, 503, 504, 408, 429])  // Status codes to retry
    .backoff_factor(1.0)  // Backoff factor for exponential backoff
    .max_delay(Duration::from_secs(180));  // Maximum delay between retries

User-Agent Rotation

Rotates user agent strings to avoid detection and blocking. Supports multiple rotation strategies and sources.

Configuration:

use spider_middleware::user_agent::{UserAgentMiddleware, UserAgentSource, UserAgentRotationStrategy, BuiltinUserAgentList};
use std::path::PathBuf;

// Using built-in user agents
let user_agent_middleware = UserAgentMiddleware::builder()
    .source(UserAgentSource::Builtin(BuiltinUserAgentList::Random))
    .strategy(UserAgentRotationStrategy::Random)
    .build()?;

// Using custom list
let user_agent_middleware = UserAgentMiddleware::builder()
    .source(UserAgentSource::List(vec![
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36".to_string(),
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36".to_string(),
    ]))
    .strategy(UserAgentRotationStrategy::Sequential)
    .build()?;

// Using file source
let mut file_path = PathBuf::new();
file_path.push("user-agents.txt");
let user_agent_middleware = UserAgentMiddleware::builder()
    .source(UserAgentSource::File(file_path))
    .strategy(UserAgentRotationStrategy::Sticky)
    .session_duration(Duration::from_secs(300))  // 5 minutes for sticky session
    .build()?;

Referer Management

Handles the referer header appropriately for requests to simulate natural browsing behavior.

Configuration:

use spider_middleware::referer::RefererMiddleware;

let referer_middleware = RefererMiddleware::new()
    .same_origin_only(true)       // Only use referers from the same origin
    .max_chain_length(1000)       // Maximum number of referers to keep in memory
    .include_fragment(false);     // Exclude URL fragments from referer

Cookies

Manages cookies across requests to maintain sessions.

Configuration:

use spider_middleware::cookies::CookieMiddleware;

// Basic usage
let cookie_middleware = CookieMiddleware::new();

// Loading from JSON file
let cookie_middleware = CookieMiddleware::from_json("cookies.json").await?;

// Loading from Netscape cookie file
let cookie_middleware = CookieMiddleware::from_netscape_file("cookies.txt").await?;

// Loading from RFC6265 format
let cookie_middleware = CookieMiddleware::from_rfc6265("cookies.rfc6265").await?;

HTTP Caching

Caches responses locally to speed up development and reduce server load.

Configuration:

use spider_middleware::http_cache::HttpCacheMiddleware;
use std::path::PathBuf;

let mut cache_dir = PathBuf::new();
cache_dir.push("cache");

let http_cache_middleware = HttpCacheMiddleware::builder()
    .cache_dir(cache_dir)
    .build()?;

Robots.txt

Ensures compliance with robots.txt rules.

Configuration:

use spider_middleware::robots_txt::RobotsTxtMiddleware;
use std::time::Duration;

let robots_txt_middleware = RobotsTxtMiddleware::new()
    .cache_ttl(Duration::from_secs(86400))      // Cache TTL: 24 hours
    .cache_capacity(10_000)                     // Max cache entries
    .request_timeout(Duration::from_secs(5));   // Timeout for fetching robots.txt

Proxy

Manages proxy servers for requests to avoid IP-based blocking.

Configuration:

use spider_middleware::proxy::{ProxyMiddleware, ProxySource, ProxyRotationStrategy};
use std::path::PathBuf;

// Using custom list
let proxy_middleware = ProxyMiddleware::builder()
    .source(ProxySource::List(vec![
        "http://proxy1.example.com:8080".to_string(),
        "http://proxy2.example.com:8080".to_string(),
    ]))
    .strategy(ProxyRotationStrategy::Sequential)
    .build()?;

// Using file source
let mut file_path = PathBuf::new();
file_path.push("proxies.txt");
let proxy_middleware = ProxyMiddleware::builder()
    .source(ProxySource::File(file_path))
    .strategy(ProxyRotationStrategy::Random)
    .build()?;

// Sticky failover strategy with block detection
let proxy_middleware = ProxyMiddleware::builder()
    .source(ProxySource::List(vec![
        "http://proxy1.example.com:8080".to_string(),
        "http://proxy2.example.com:8080".to_string(),
    ]))
    .strategy(ProxyRotationStrategy::StickyFailover)
    .with_block_detection_texts(vec!["Access Denied".to_string()])
    .build()?;

License

This project is licensed under the MIT License - see the LICENSE file for details.