spider-core
spider-core is the runtime heart of the workspace. It owns the crawling loop, the Spider trait, the builder used to compose a crawler, the scheduler, shared state, and runtime stats.
Most applications should still start with spider-lib, because the facade crate re-exports the common pieces. spider-core is the crate to reach for when you want tighter control over runtime composition or when you are building extensions against the lower-level API.
When it makes sense to depend on this crate
Use spider-core directly if you are:
- building on the runtime without the root facade crate
- integrating a custom downloader, middleware stack, or pipeline stack
- publishing reusable extensions that should depend on the runtime contracts rather than the application-facing facade
If your goal is simply “write a spider and run it”, spider-lib is usually more convenient.
Installation
[]
= "2.0.2"
= { = "1.0", = ["derive"] }
= "1.0"
You only need serde and serde_json when you use #[scraped_item].
What lives here
The main exports are:
Spiderfor crawl logicCrawlerfor the runtime handleCrawlerBuilderfor composition and configurationSchedulerfor request admission and deduplication- shared state primitives such as counters and concurrent maps
StatCollectorfor runtime statistics
The runtime loop is intentionally simple:
Spider::start_requestsseeds the crawl.- Requests go through scheduling and deduplication.
- The downloader fetches responses.
- Middleware can alter requests, responses, or retry behavior.
Spider::parsereturns aParseOutputcontaining items and follow-up requests.- Pipelines process emitted items.
API landmarks
If you are skimming docs.rs, these are the most useful entry points:
Spider: define crawl behaviorStartRequests: describe how the crawl is seededCrawlerBuilder: tune concurrency and attach middleware/pipelinesCrawler: start and monitor the running crawlStatCollector: inspect runtime statsstate::*: thread-safe primitives for shared parse-time state
Minimal example
use ;
use ;
;
;
async
limit(1) is handy for previews and smoke runs because it stops after the first admitted item.
Where decisions usually belong
- Use
Spider::start_urlsfor simple static seeds. - Use
Spider::start_requestswhen seeds need fullRequestvalues, metadata, or file-backed loading. - Use middleware for HTTP lifecycle policy.
- Use pipelines for item lifecycle policy.
- Use a custom downloader when transport execution itself must change.
Feature flags
| Feature | Purpose |
|---|---|
core |
Base runtime support. Enabled by default. |
live-stats |
In-place terminal statistics display. |
checkpoint |
Checkpoint and resume support. |
cookie-store |
cookie_store integration in core state. |
[]
= { = "2.0.2", = ["checkpoint"] }
Practical note
If you want a full working example instead of a runtime skeleton, the repository-level books example is the best reference point. It uses the facade crate, but the runtime flow is the same one spider-core drives.
Related crates
License
MIT. See LICENSE.