Skip to main content

Crate spider_middleware

Crate spider_middleware 

Source
Expand description

§spider-middleware

Provides built-in middleware implementations for the spider-lib framework.

§Overview

The spider-middleware crate contains a comprehensive collection of middleware implementations that extend the functionality of web crawlers. Middlewares intercept and process requests and responses, enabling features like rate limiting, retries, user-agent rotation, and more.

§Available Middlewares

  • Rate Limiting: Controls request rates to prevent server overload
  • Retries: Automatically retries failed or timed-out requests
  • User-Agent Rotation: Manages and rotates user agents
  • Referer Management: Handles the Referer header
  • Cookies: Persists cookies across requests to maintain sessions
  • HTTP Caching: Caches responses to accelerate development
  • Robots.txt: Adheres to robots.txt rules
  • Proxy: Manages and rotates proxy servers

§Architecture

Each middleware implements the Middleware trait, allowing them to intercept requests before they’re sent and responses after they’re received. This enables flexible, composable behavior customization for crawlers.

§Example

use spider_middleware::rate_limit::RateLimitMiddleware;
use spider_middleware::retry::RetryMiddleware;

// Add middlewares to your crawler
let crawler = CrawlerBuilder::new(MySpider)
    .add_middleware(RateLimitMiddleware::default())
    .add_middleware(RetryMiddleware::new())
    .build()
    .await?;

Modules§

cookies
Cookie Middleware to manage the Set-Cookie header
http_cache
HTTP Cache Middleware for caching web responses.
middleware
Core Middleware trait and related types for the spider-core framework.
prelude
Commonly used items from the spider-middleware crate.
proxy
Auto-Rotate Proxy Middleware for rotating proxies during crawling.
rate_limit
Rate Limit Middleware for controlling request frequency.
referer
Referer Middleware for managing HTTP Referer headers.
request
Data structures for representing HTTP requests in spider-lib.
retry
Retry Middleware for handling failed requests.
robots_txt
Robots.txt Middleware for respecting website crawling policies.
user_agent
User-Agent Middleware for rotating User-Agents during crawling.

Structs§

Request
Response
Represents an HTTP response received from a server.