scrapling-spider
The web crawling engine for the scrapling-rs framework. This crate provides the orchestration layer that ties together HTTP fetching, request scheduling, response caching, robots.txt compliance, and checkpoint-based pause/resume into a single crawl loop.
Architecture
A crawl is driven by two main abstractions:
-
[
Spider] -- a trait you implement to define what to crawl. You supply start URLs, aparsefunction that extracts data and follow-up links from each response, and optional configuration knobs (concurrency, download delay, allowed domains, etc.). -
[
CrawlerEngine] -- the runtime that executes the crawl loop. It owns a [Scheduler] (priority queue with deduplication), a [SessionManager] (named HTTP sessions), and optional managers for robots.txt, response caching, and checkpointing. You create an engine, hand it yourSpider, and call [CrawlerEngine::crawl].
Module overview
| Module | Purpose |
|---|---|
[spider] |
The Spider trait and CrawlerEngine crawl loop |
[request] |
Request, Callback, and SpiderOutput types |
[result] |
CrawlResult, CrawlStats, and ItemList output types |
[scheduler] |
Priority-queue scheduler with fingerprint deduplication |
[session] |
Session / SessionManager for named HTTP backends |
[cache] |
Filesystem-based HTTP response cache for dev mode |
[checkpoint] |
Pause/resume support via JSON snapshots on disk |
[robotstxt] |
Fetching and enforcing robots.txt rules per domain |
[error] |
SpiderError enum and Result alias |
[logging] |
Thread-safe log-level counter for crawl statistics |
Quick start
use ;
use Response;
;
# async