spider-core

spider-core is the runtime heart of the workspace. It owns the crawling loop, the Spider trait, the builder used to compose a crawler, the scheduler, shared state, and runtime stats.

Most applications should still start with spider-lib, because the facade crate re-exports the common pieces. spider-core is the crate to reach for when you want tighter control over runtime composition or when you are building extensions against the lower-level API.

When it makes sense to depend on this crate

Use spider-core directly if you are:

building on the runtime without the root facade crate
integrating a custom downloader, middleware stack, or pipeline stack
publishing reusable extensions that should depend on the runtime contracts rather than the application-facing facade

If your goal is simply “write a spider and run it”, spider-lib is usually more convenient.

Installation

[dependencies]
spider-core = "2.0.2"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

You only need serde and serde_json when you use #[scraped_item].

What lives here

The main exports are:

Spider for crawl logic
Crawler for the runtime handle
CrawlerBuilder for composition and configuration
Scheduler for request admission and deduplication
shared state primitives such as counters and concurrent maps
StatCollector for runtime statistics

The runtime loop is intentionally simple:

Spider::start_requests seeds the crawl.
Requests go through scheduling and deduplication.
The downloader fetches responses.
Middleware can alter requests, responses, or retry behavior.
Spider::parse returns a ParseOutput containing items and follow-up requests.
Pipelines process emitted items.

API landmarks

If you are skimming docs.rs, these are the most useful entry points:

Spider: define crawl behavior
StartRequests: describe how the crawl is seeded
CrawlerBuilder: tune concurrency and attach middleware/pipelines
Crawler: start and monitor the running crawl
StatCollector: inspect runtime stats
state::*: thread-safe primitives for shared parse-time state

Minimal example

use spider_core::{async_trait, CrawlerBuilder, Spider};
use spider_util::{error::SpiderError, item::ParseOutput, response::Response};

#[spider_macro::scraped_item]
struct Item {
    title: String,
}

#[derive(Clone, Default)]
struct State;

struct MySpider;

#[async_trait]
impl Spider for MySpider {
    type Item = Item;
    type State = State;

    fn start_requests(&self) -> Result<spider_core::StartRequests<'_>, SpiderError> {
        Ok(spider_core::StartRequests::Urls(vec!["https://example.com"]))
    }

    async fn parse(
        &self,
        _response: Response,
        _state: &Self::State,
    ) -> Result<ParseOutput<Self::Item>, SpiderError> {
        Ok(ParseOutput::new())
    }
}

async fn run() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(MySpider)
        .limit(1)
        .build()
        .await?;

    crawler.start_crawl().await
}

limit(1) is handy for previews and smoke runs because it stops after the first admitted item.

Where decisions usually belong

Use Spider::start_urls for simple static seeds.
Use Spider::start_requests when seeds need full Request values, metadata, or file-backed loading.
Use middleware for HTTP lifecycle policy.
Use pipelines for item lifecycle policy.
Use a custom downloader when transport execution itself must change.

Feature flags

Feature	Purpose
`core`	Base runtime support. Enabled by default.
`live-stats`	In-place terminal statistics display.
`checkpoint`	Checkpoint and resume support.
`cookie-store`	`cookie_store` integration in core state.

[dependencies]
spider-core = { version = "2.0.2", features = ["checkpoint"] }

Practical note

If you want a full working example instead of a runtime skeleton, the repository-level books example is the best reference point. It uses the facade crate, but the runtime flow is the same one spider-core drives.

Related crates

License

MIT. See LICENSE.

spider-core 2.0.2