spider-lib
A modular Rust web scraping framework inspired by Scrapy.
spider-lib is the facade crate for this workspace. It re-exports crawler runtime, downloader, middleware, pipelines, utility types, and macros so you can start with one dependency and enable only the features you need.
Table of Contents
- Workspace Crates
- Architecture at a Glance
- Installation
- Quick Start
- Downloader Usage
- Middleware Usage
- Pipeline Usage
- Feature Flag Cookbook
- Development
- Documentation
- License
Workspace Crates
spider-core: crawler runtime, spider trait, scheduler, builder, state, and stats.spider-downloader: downloader traits and reqwest-based downloader implementation.spider-macro: procedural macros such as#[scraped_item].spider-middleware: retry, rate limiting, robots, cookies, proxy, cache, and user-agent middleware.spider-pipeline: item processing and output pipelines (JSON, JSONL, CSV, SQLite, stream JSON).spider-util: shared request/response/item/error types and helper utilities.
Architecture at a Glance
Spider generates initial URLs and parses responses, while the crawler orchestrates request execution and data flow.
Spider::start_urls
-> Scheduler
-> Downloader (default: reqwest)
-> Middleware chain (before/after download)
-> Spider::parse(Response) -> ParseOutput { requests, items }
-> Pipeline chain (transform/validate/dedup/export)
Installation
[]
= "3.0.0"
= { = "1.0", = ["derive"] }
= "1.0"
serde and serde_json are required when using #[scraped_item].
Quick Start
use *;
;
;
async
Try maintained examples:
Downloader Usage
Default downloader
CrawlerBuilder uses a reqwest-based downloader by default, so no extra setup is needed for most projects.
Build a Custom Downloader
Use a custom downloader when you need custom transport behavior (special auth, alternate HTTP stack, tracing, etc.).
Trait contract (Downloader):
use async_trait;
use *;
Runtime integration:
let crawler = new
.downloader
.build
.await?;
For trait details and lower-level integration, see spider-downloader.
Middleware Usage
Core middleware (always available):
RateLimitMiddleware: controls request throughput.RetryMiddleware: retries failed/transient requests.RefererMiddleware: populatesRefererheader for follow-up requests.
Optional middleware (feature-gated):
HttpCacheMiddleware(middleware-cache)AutoThrottleMiddleware(middleware-autothrottle)ProxyMiddleware(middleware-proxy)UserAgentMiddleware(middleware-user-agent)RobotsTxtMiddleware(middleware-robots)CookieMiddleware(middleware-cookies)
let crawler = new
.add_middleware
.add_middleware
.add_middleware
.build
.await?;
Build a Custom Middleware
Trait contract (Middleware<C>):
Minimal implementation (override only what you need):
use *;
;
Runtime integration:
let crawler = new
.add_middleware
.build
.await?;
See full per-feature middleware examples in spider-middleware.
Pipeline Usage
Core pipelines (always available):
TransformPipelineValidationPipelineDeduplicationPipelineConsolePipeline
Optional output pipelines (feature-gated):
JsonPipeline(pipeline-json)JsonlPipeline(pipeline-jsonl)CsvPipeline(pipeline-csv)SqlitePipeline(pipeline-sqlite)StreamJsonPipeline(pipeline-stream-json)
let crawler = new
.add_pipeline
.add_pipeline
.add_pipeline
.add_pipeline
.build
.await?;
Build a Custom Pipeline
Trait contract (Pipeline<I: ScrapedItem>):
Minimal implementation:
use *;
;
Runtime integration:
let crawler = new
.add_pipeline
.build
.await?;
See full per-feature pipeline examples in spider-pipeline.
Feature Flag Cookbook
Minimal core crawler
[]
= "3.0.0"
Robots + JSONL export
[]
= { = "3.0.0", = ["middleware-robots", "pipeline-jsonl"] }
Proxy + user-agent rotation + CSV output
[]
= { = "3.0.0", = ["middleware-proxy", "middleware-user-agent", "pipeline-csv"] }
Cache + autothrottle + SQLite output
[]
= { = "3.0.0", = ["middleware-cache", "middleware-autothrottle", "pipeline-sqlite"] }
Live stats and resume support
[]
= { = "3.0.0", = ["live-stats", "checkpoint"] }
Cookie-aware crawling
[]
= { = "3.0.0", = ["cookie-store"] }
cookie-store enables middleware-cookies transitively.
Development
Documentation
- API docs: https://docs.rs/spider-lib
- Contribution guide: CONTRIBUTING.md
License
MIT. See LICENSE.