spider-middleware
spider-middleware contains the middleware layer used by the crawler runtime. This is where request and response behavior can be adjusted without pushing transport concerns into the downloader or page-specific logic into the spider.
For most application code, middleware is enabled through spider-lib feature flags. This crate becomes more useful when you want to work against the middleware trait directly or publish reusable middleware.
When to use it directly
Use spider-middleware if you want to:
- compose middleware without the facade crate
- implement custom middleware against the shared runtime contract
- publish middleware for other
spider-coreusers
If your only goal is to enable built-in middleware in an app, the root crate is still the smoother path.
Installation
[]
= "0.3.6"
Built-in middleware
Always available:
| Type | Purpose |
|---|---|
RateLimitMiddleware |
Smooths request throughput. |
RetryMiddleware |
Retries failed requests according to retry policy. |
RefererMiddleware |
Sets Referer for follow-up requests. |
Feature-gated modules:
| Feature | Type | Use case |
|---|---|---|
middleware-cache |
HttpCacheMiddleware |
Reuse cached responses. |
middleware-autothrottle |
AutoThrottleMiddleware |
Adapt crawl pace to observed conditions. |
middleware-proxy |
ProxyMiddleware |
Route traffic through proxies. |
middleware-user-agent |
UserAgentMiddleware |
Set or rotate user agents. |
middleware-robots |
RobotsTxtMiddleware |
Respect robots.txt. |
middleware-cookies |
CookieMiddleware |
Store and attach cookies. |
Runtime usage
use ;
let crawler = new
.add_middleware
.add_middleware
.add_middleware
.build
.await?;
Custom middleware example
use async_trait;
use ;
use ;
;
Wire it into the runtime with CrawlerBuilder::add_middleware(...).
Hook lifecycle
Middleware is easier to reason about if you treat it as three distinct hooks:
process_requestruns before download and can rewrite, drop, or short-circuit a request.process_responseruns after a successful download and can rewrite, drop, or retry.handle_errorruns on download failure and can propagate, drop, or retry.
In practice, request-shaping concerns belong in process_request, status/body-based policy belongs in process_response, and recovery policy belongs in handle_error.
Feature flags
[]
= { = "0.3.6", = ["middleware-robots", "middleware-user-agent"] }
When you depend on the root crate instead, enable the same feature names on spider-lib.
Ordering matters
A reasonable default order for many crawlers is:
RefererMiddlewareUserAgentMiddlewareProxyMiddlewareRateLimitMiddlewareAutoThrottleMiddlewareRetryMiddlewareHttpCacheMiddlewareRobotsTxtMiddlewareCookieMiddleware
That is only a starting point. Retry, cache, robots, and cookie behavior all depend on order, so it is worth being intentional.
Related crates
License
MIT. See LICENSE.