Skip to main content

Crate spider_lib

Crate spider_lib 

Source
Expand description

§spider-lib

spider-lib is the easiest way to use this workspace as an application framework. It re-exports the crawler runtime, common middleware and pipelines, shared request and response types, and the #[scraped_item] macro behind one crate.

If you want the lower-level pieces individually, the workspace also exposes spider-core, spider-middleware, spider-pipeline, spider-downloader, spider-macro, and spider-util. Most users should start here.

§What you get from the facade crate

The root crate is optimized for application authors:

  • prelude re-exports the common types needed to define and run a spider
  • Spider describes crawl behavior
  • CrawlerBuilder assembles the runtime
  • Request, Response, and ParseOutput are the core runtime data types
  • middleware and pipelines can be enabled with feature flags and then added through the builder

§Installation

[dependencies]
spider-lib = "3.0.2"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"

serde and serde_json are required when you use scraped_item.

§Quick start

use spider_lib::prelude::*;

#[scraped_item]
struct Quote {
    text: String,
    author: String,
}

struct QuotesSpider;

#[async_trait]
impl Spider for QuotesSpider {
    type Item = Quote;
    type State = ();

    fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
        Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
    }

    async fn parse(
        &self,
        response: Response,
        _state: &Self::State,
    ) -> Result<ParseOutput<Self::Item>, SpiderError> {
        let html = response.to_html()?;
        let mut output = ParseOutput::new();

        for quote in html.select(&".quote".to_selector()?) {
            let text = quote
                .select(&".text".to_selector()?)
                .next()
                .map(|node| node.text().collect::<String>())
                .unwrap_or_default();

            let author = quote
                .select(&".author".to_selector()?)
                .next()
                .map(|node| node.text().collect::<String>())
                .unwrap_or_default();

            output.add_item(Quote { text, author });
        }

        Ok(output)
    }
}

#[tokio::main]
async fn main() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(QuotesSpider).build().await?;
    crawler.start_crawl().await
}

Spider::parse takes &self and a separate shared state parameter. That design keeps the spider itself immutable while still allowing concurrent parsing with user-defined shared state.

§Typical next steps

After the minimal spider works, the next additions are usually:

  1. add one or more middleware with CrawlerBuilder::add_middleware
  2. add one or more pipelines with CrawlerBuilder::add_pipeline
  3. move repeated parse-time state into Spider::State
  4. enable optional features such as live-stats, pipeline-csv, or middleware-robots

If you find yourself needing transport-level customization, custom middleware contracts, or lower-level runtime control, move down to the crate-specific APIs in spider-core, spider-downloader, spider-middleware, or spider-pipeline.

Re-exports§

pub use prelude::*;

Modules§

prelude
Convenient re-exports for spider-lib applications.

Attribute Macros§

scraped_item
Attribute macro for defining a scraped item type.