Crate scrapling_spider

Expand description

§scrapling-spider

The web crawling engine for the scrapling-rs framework. This crate provides the orchestration layer that ties together HTTP fetching, request scheduling, response caching, robots.txt compliance, and checkpoint-based pause/resume into a single crawl loop.

§Architecture

A crawl is driven by two main abstractions:

Spider – a trait you implement to define what to crawl. You supply start URLs, a parse function that extracts data and follow-up links from each response, and optional configuration knobs (concurrency, download delay, allowed domains, etc.).
CrawlerEngine – the runtime that executes the crawl loop. It owns a Scheduler (priority queue with deduplication), a SessionManager (named HTTP sessions), and optional managers for robots.txt, response caching, and checkpointing. You create an engine, hand it your Spider, and call CrawlerEngine::crawl.

§Module overview

Module	Purpose
`spider`	The `Spider` trait and `CrawlerEngine` crawl loop
`request`	`Request`, `Callback`, and `SpiderOutput` types
`result`	`CrawlResult`, `CrawlStats`, and `ItemList` output types
`scheduler`	Priority-queue scheduler with fingerprint deduplication
`session`	`Session` / `SessionManager` for named HTTP backends
`cache`	Filesystem-based HTTP response cache for dev mode
`checkpoint`	Pause/resume support via JSON snapshots on disk
`robotstxt`	Fetching and enforcing robots.txt rules per domain
`error`	`SpiderError` enum and `Result` alias
`logging`	Thread-safe log-level counter for crawl statistics

§Quick start

use scrapling_spider::{Spider, CrawlerEngine, Request, SpiderOutput};
use scrapling_fetch::Response;

struct MyScraper;

impl Spider for MyScraper {
    fn name(&self) -> &str { "my_scraper" }
    fn start_urls(&self) -> Vec<String> {
        vec!["https://example.com".into()]
    }
    fn parse(&self, response: Response) -> Vec<SpiderOutput> {
        // Extract data or follow links here
        vec![]
    }
}

let spider = MyScraper;
let mut engine = CrawlerEngine::new(&spider, None, 0.0).unwrap();
let stats = engine.crawl().await.unwrap();
println!("Scraped {} items", stats.items_scraped);

Re-exports§

pub use error::Result;
pub use error::SpiderError;
pub use request::Callback;
pub use request::Request;
pub use request::SpiderOutput;
pub use result::CrawlResult;
pub use result::CrawlStats;
pub use result::ItemList;
pub use scheduler::Scheduler;
pub use session::Session;
pub use session::SessionManager;
pub use spider::CrawlerEngine;
pub use spider::Spider;

Modules§

cache: Filesystem-based HTTP response cache for development mode.
checkpoint: Pause/resume support via JSON checkpoint files.
error: Error types for the spider crate.
logging: Thread-safe log-level counter for crawl diagnostics.
request: Request and callback types for the spider crawl pipeline.
result: Crawl output types: statistics, scraped items, and the final result.
robotstxt: Robots.txt fetching, parsing, and enforcement.
scheduler: Priority-queue request scheduler with fingerprint-based deduplication.
session: Named HTTP session management for the crawler.
spider: The Spider trait and CrawlerEngine – the heart of scrapling-spider.

Crate scrapling_spider

Crate scrapling_spider Copy item path

§scrapling-spider

§Architecture

§Module overview

§Quick start

Re-exports§

Modules§

Crate scrapling_spider