Crate spider_lib

Expand description

§spider-lib

spider-lib is the easiest way to use this workspace as an application framework. It re-exports the crawler runtime, common middleware and pipelines, shared request and response types, and the #[scraped_item] macro behind one crate.

If you want the lower-level pieces individually, the workspace also exposes spider-core, spider-middleware, spider-pipeline, spider-downloader, spider-macro, and spider-util. Most users should start here.

§What you get from the facade crate

The root crate is optimized for application authors:

prelude re-exports the common types needed to define and run a spider
Spider describes crawl behavior
CrawlerBuilder assembles the runtime
Request, Response, and ParseOutput are the core runtime data types
middleware and pipelines can be enabled with feature flags and then added through the builder

§Installation

[dependencies]
spider-lib = "3.0.2"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

serde and serde_json are required when you use scraped_item.

§Quick start

use spider_lib::prelude::*;

#[scraped_item]
struct Quote {
    text: String,
    author: String,
}

struct QuotesSpider;

#[async_trait]
impl Spider for QuotesSpider {
    type Item = Quote;
    type State = ();

    fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
        Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
    }

    async fn parse(
        &self,
        response: Response,
        _state: &Self::State,
    ) -> Result<ParseOutput<Self::Item>, SpiderError> {
        let html = response.to_html()?;
        let mut output = ParseOutput::new();

        for quote in html.select(&".quote".to_selector()?) {
            let text = quote
                .select(&".text".to_selector()?)
                .next()
                .map(|node| node.text().collect::<String>())
                .unwrap_or_default();

            let author = quote
                .select(&".author".to_selector()?)
                .next()
                .map(|node| node.text().collect::<String>())
                .unwrap_or_default();

            output.add_item(Quote { text, author });
        }

        Ok(output)
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(QuotesSpider).build().await?;
    crawler.start_crawl().await
}

Spider::parse takes &self and a separate shared state parameter. That design keeps the spider itself immutable while still allowing concurrent parsing with user-defined shared state.

§Typical next steps

After the minimal spider works, the next additions are usually:

add one or more middleware with CrawlerBuilder::add_middleware
add one or more pipelines with CrawlerBuilder::add_pipeline
move repeated parse-time state into Spider::State
enable optional features such as live-stats, pipeline-csv, or middleware-robots

If you find yourself needing transport-level customization, custom middleware contracts, or lower-level runtime control, move down to the crate-specific APIs in spider-core, spider-downloader, spider-middleware, or spider-pipeline.

Re-exports§

pub use prelude::*;

Modules§

prelude: Convenient re-exports for spider-lib applications.

Attribute Macros§

scraped_item: Attribute macro for defining a scraped item type.

Crate spider_lib

Crate spider_lib Copy item path

§spider-lib

§What you get from the facade crate

§Installation

§Quick start

§Typical next steps

Re-exports§

Modules§

Attribute Macros§

Crate spider_lib