Skip to main content

Crate spider_core

Crate spider_core 

Source
Expand description

§spider-core

The core engine of the spider-lib web scraping framework.

This crate provides the fundamental components for building web scrapers, including the main Crawler, Scheduler, Spider trait, and other essential infrastructure for managing the crawling process.

§Overview

The spider-core crate implements the central orchestration layer of the web scraping framework. It manages the flow of requests and responses, coordinates concurrent operations, and provides the foundation for middleware and pipeline systems.

§Key Components

  • Crawler: The main orchestrator that manages the crawling process
  • Scheduler: Handles request queuing and duplicate detection
  • Spider: Trait defining the interface for custom scraping logic
  • CrawlerBuilder: Fluent API for configuring and building crawlers
  • Middleware: Interceptors for processing requests and responses
  • Pipeline: Processors for scraped items
  • Stats: Collection and reporting of crawl statistics

§Usage

Most users will interact with the components re-exported from this crate through the main spider-lib facade. However, this crate can be used independently for fine-grained control over the crawling process.

use spider_core::{Crawler, CrawlerBuilder, Spider, Scheduler};
use spider_util::{request::Request, response::Response, error::SpiderError};

#[derive(Default)]
struct MySpider;

#[spider_macro::scraped_item]
struct MyItem {
    title: String,
    url: String,
}

#[async_trait::async_trait]
impl Spider for MySpider {
    type Item = MyItem;

    fn start_urls(&self) -> Vec<&'static str> {
        vec!["https://example.com"]
    }

    async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
        // Custom parsing logic here
        todo!()
    }
}

async fn run_crawler() -> Result<(), SpiderError> {
    let crawler = CrawlerBuilder::new(MySpider).build().await?;
    crawler.start_crawl().await
}

Re-exports§

pub use builder::CrawlerBuilder;
pub use crawler::Crawler;
pub use scheduler::Scheduler;
pub use spider::Spider;
pub use tokio;

Modules§

builder
Builder Module
crawler
Crawler Module
prelude
A “prelude” for users of the spider-core crate.
scheduler
Scheduler Module
spider
Spider Module
state
Module for tracking the operational state of the crawler.
stats
Statistics Module

Structs§

DashMap
DashMap is an implementation of a concurrent associative array/hashmap in Rust.
ReqwestClientDownloader
Concrete implementation of Downloader using reqwest client

Traits§

Downloader
A trait for HTTP downloaders that can fetch web pages and apply middleware
SimpleHttpClient
A simple HTTP client trait for fetching web content.

Attribute Macros§

async_trait
scraped_item
A procedural macro to derive the ScrapedItem trait.