Expand description
§scrapling-spider
The web crawling engine for the scrapling-rs framework. This crate provides the orchestration layer that ties together HTTP fetching, request scheduling, response caching, robots.txt compliance, and checkpoint-based pause/resume into a single crawl loop.
§Architecture
A crawl is driven by two main abstractions:
-
Spider– a trait you implement to define what to crawl. You supply start URLs, aparsefunction that extracts data and follow-up links from each response, and optional configuration knobs (concurrency, download delay, allowed domains, etc.). -
CrawlerEngine– the runtime that executes the crawl loop. It owns aScheduler(priority queue with deduplication), aSessionManager(named HTTP sessions), and optional managers for robots.txt, response caching, and checkpointing. You create an engine, hand it yourSpider, and callCrawlerEngine::crawl.
§Module overview
| Module | Purpose |
|---|---|
spider | The Spider trait and CrawlerEngine crawl loop |
request | Request, Callback, and SpiderOutput types |
result | CrawlResult, CrawlStats, and ItemList output types |
scheduler | Priority-queue scheduler with fingerprint deduplication |
session | Session / SessionManager for named HTTP backends |
cache | Filesystem-based HTTP response cache for dev mode |
checkpoint | Pause/resume support via JSON snapshots on disk |
robotstxt | Fetching and enforcing robots.txt rules per domain |
error | SpiderError enum and Result alias |
logging | Thread-safe log-level counter for crawl statistics |
§Quick start
use scrapling_spider::{Spider, CrawlerEngine, Request, SpiderOutput};
use scrapling_fetch::Response;
struct MyScraper;
impl Spider for MyScraper {
fn name(&self) -> &str { "my_scraper" }
fn start_urls(&self) -> Vec<String> {
vec!["https://example.com".into()]
}
fn parse(&self, response: Response) -> Vec<SpiderOutput> {
// Extract data or follow links here
vec![]
}
}
let spider = MyScraper;
let mut engine = CrawlerEngine::new(&spider, None, 0.0).unwrap();
let stats = engine.crawl().await.unwrap();
println!("Scraped {} items", stats.items_scraped);Re-exports§
pub use error::Result;pub use error::SpiderError;pub use request::Callback;pub use request::Request;pub use request::SpiderOutput;pub use result::CrawlResult;pub use result::CrawlStats;pub use result::ItemList;pub use scheduler::Scheduler;pub use session::Session;pub use session::SessionManager;pub use spider::CrawlerEngine;pub use spider::Spider;
Modules§
- cache
- Filesystem-based HTTP response cache for development mode.
- checkpoint
- Pause/resume support via JSON checkpoint files.
- error
- Error types for the spider crate.
- logging
- Thread-safe log-level counter for crawl diagnostics.
- request
- Request and callback types for the spider crawl pipeline.
- result
- Crawl output types: statistics, scraped items, and the final result.
- robotstxt
- Robots.txt fetching, parsing, and enforcement.
- scheduler
- Priority-queue request scheduler with fingerprint-based deduplication.
- session
- Named HTTP session management for the crawler.
- spider
- The
Spidertrait andCrawlerEngine– the heart of scrapling-spider.