Expand description
§spider-core
The core engine of the spider-lib web scraping framework.
This crate provides the fundamental components for building web scrapers,
including the main Crawler, Scheduler, Spider trait, and other
essential infrastructure for managing the crawling process.
§Overview
The spider-core crate implements the central orchestration layer of the
web scraping framework. It manages the flow of requests and responses,
coordinates concurrent operations, and provides the foundation for
middleware and pipeline systems.
§Key Components
- Crawler: The main orchestrator that manages the crawling process
- Scheduler: Handles request queuing and duplicate detection
- Spider: Trait defining the interface for custom scraping logic
- CrawlerBuilder: Fluent API for configuring and building crawlers
- Middleware: Interceptors for processing requests and responses
- Pipeline: Processors for scraped items
- Stats: Collection and reporting of crawl statistics
§Usage
Most users will interact with the components re-exported from this crate
through the main spider-lib facade. However, this crate can be used
independently for fine-grained control over the crawling process.
ⓘ
use spider_core::{Crawler, CrawlerBuilder, Spider, Scheduler};
use spider_util::{request::Request, response::Response, error::SpiderError};
#[derive(Default)]
struct MySpider;
#[spider_macro::scraped_item]
struct MyItem {
title: String,
url: String,
}
#[async_trait::async_trait]
impl Spider for MySpider {
type Item = MyItem;
fn start_urls(&self) -> Vec<&'static str> {
vec!["https://example.com"]
}
async fn parse(&mut self, response: Response) -> Result<ParseOutput<Self::Item>, SpiderError> {
// Custom parsing logic here
todo!()
}
}
async fn run_crawler() -> Result<(), SpiderError> {
let crawler = CrawlerBuilder::new(MySpider).build().await?;
crawler.start_crawl().await
}Re-exports§
pub use builder::CrawlerBuilder;pub use crawler::Crawler;pub use scheduler::Scheduler;pub use spider::Spider;pub use tokio;
Modules§
- builder
- Builder Module
- crawler
- Crawler Module
- prelude
- A “prelude” for users of the
spider-corecrate. - scheduler
- Scheduler Module
- spider
- Spider Module
- state
- Module for tracking the operational state of the crawler.
- stats
- Statistics Module
Structs§
- DashMap
- DashMap is an implementation of a concurrent associative array/hashmap in Rust.
- Reqwest
Client Downloader - Concrete implementation of Downloader using reqwest client
Traits§
- Downloader
- A trait for HTTP downloaders that can fetch web pages and apply middleware
- Simple
Http Client - A simple HTTP client trait for fetching web content.
Attribute Macros§
- async_
trait - scraped_
item - A procedural macro to derive the
ScrapedItemtrait.