Skip to main content

Crate scrapling_spider

Crate scrapling_spider 

Source
Expand description

§scrapling-spider

The web crawling engine for the scrapling-rs framework. This crate provides the orchestration layer that ties together HTTP fetching, request scheduling, response caching, robots.txt compliance, and checkpoint-based pause/resume into a single crawl loop.

§Architecture

A crawl is driven by two main abstractions:

  • Spider – a trait you implement to define what to crawl. You supply start URLs, a parse function that extracts data and follow-up links from each response, and optional configuration knobs (concurrency, download delay, allowed domains, etc.).

  • CrawlerEngine – the runtime that executes the crawl loop. It owns a Scheduler (priority queue with deduplication), a SessionManager (named HTTP sessions), and optional managers for robots.txt, response caching, and checkpointing. You create an engine, hand it your Spider, and call CrawlerEngine::crawl.

§Module overview

ModulePurpose
spiderThe Spider trait and CrawlerEngine crawl loop
requestRequest, Callback, and SpiderOutput types
resultCrawlResult, CrawlStats, and ItemList output types
schedulerPriority-queue scheduler with fingerprint deduplication
sessionSession / SessionManager for named HTTP backends
cacheFilesystem-based HTTP response cache for dev mode
checkpointPause/resume support via JSON snapshots on disk
robotstxtFetching and enforcing robots.txt rules per domain
errorSpiderError enum and Result alias
loggingThread-safe log-level counter for crawl statistics

§Quick start

use scrapling_spider::{Spider, CrawlerEngine, Request, SpiderOutput};
use scrapling_fetch::Response;

struct MyScraper;

impl Spider for MyScraper {
    fn name(&self) -> &str { "my_scraper" }
    fn start_urls(&self) -> Vec<String> {
        vec!["https://example.com".into()]
    }
    fn parse(&self, response: Response) -> Vec<SpiderOutput> {
        // Extract data or follow links here
        vec![]
    }
}

let spider = MyScraper;
let mut engine = CrawlerEngine::new(&spider, None, 0.0).unwrap();
let stats = engine.crawl().await.unwrap();
println!("Scraped {} items", stats.items_scraped);

Re-exports§

pub use error::Result;
pub use error::SpiderError;
pub use request::Callback;
pub use request::Request;
pub use request::SpiderOutput;
pub use result::CrawlResult;
pub use result::CrawlStats;
pub use result::ItemList;
pub use scheduler::Scheduler;
pub use session::Session;
pub use session::SessionManager;
pub use spider::CrawlerEngine;
pub use spider::Spider;

Modules§

cache
Filesystem-based HTTP response cache for development mode.
checkpoint
Pause/resume support via JSON checkpoint files.
error
Error types for the spider crate.
logging
Thread-safe log-level counter for crawl diagnostics.
request
Request and callback types for the spider crawl pipeline.
result
Crawl output types: statistics, scraped items, and the final result.
robotstxt
Robots.txt fetching, parsing, and enforcement.
scheduler
Priority-queue request scheduler with fingerprint-based deduplication.
session
Named HTTP session management for the crawler.
spider
The Spider trait and CrawlerEngine – the heart of scrapling-spider.