scrapling-spider

The web crawling engine for the scrapling-rs framework. This crate provides the orchestration layer that ties together HTTP fetching, request scheduling, response caching, robots.txt compliance, and checkpoint-based pause/resume into a single crawl loop.

Architecture

A crawl is driven by two main abstractions:

[Spider] -- a trait you implement to define what to crawl. You supply start URLs, a parse function that extracts data and follow-up links from each response, and optional configuration knobs (concurrency, download delay, allowed domains, etc.).
[CrawlerEngine] -- the runtime that executes the crawl loop. It owns a [Scheduler] (priority queue with deduplication), a [SessionManager] (named HTTP sessions), and optional managers for robots.txt, response caching, and checkpointing. You create an engine, hand it your Spider, and call [CrawlerEngine::crawl].

Module overview

Module	Purpose
[`spider`]	The `Spider` trait and `CrawlerEngine` crawl loop
[`request`]	`Request`, `Callback`, and `SpiderOutput` types
[`result`]	`CrawlResult`, `CrawlStats`, and `ItemList` output types
[`scheduler`]	Priority-queue scheduler with fingerprint deduplication
[`session`]	`Session` / `SessionManager` for named HTTP backends
[`cache`]	Filesystem-based HTTP response cache for dev mode
[`checkpoint`]	Pause/resume support via JSON snapshots on disk
[`robotstxt`]	Fetching and enforcing robots.txt rules per domain
[`error`]	`SpiderError` enum and `Result` alias
[`logging`]	Thread-safe log-level counter for crawl statistics

Quick start

use scrapling_spider::{Spider, CrawlerEngine, Request, SpiderOutput};
use scrapling_fetch::Response;

struct MyScraper;

impl Spider for MyScraper {
    fn name(&self) -> &str { "my_scraper" }
    fn start_urls(&self) -> Vec<String> {
        vec!["https://example.com".into()]
    }
    fn parse(&self, response: Response) -> Vec<SpiderOutput> {
        // Extract data or follow links here
        vec![]
    }
}

# async fn run() {
let spider = MyScraper;
let mut engine = CrawlerEngine::new(&spider, None, 0.0).unwrap();
let stats = engine.crawl().await.unwrap();
println!("Scraped {} items", stats.items_scraped);
# }

scrapling-spider 0.1.0

scrapling-spider

Architecture

Module overview

Quick start