scrapling_spider/lib.rs
1//! # scrapling-spider
2//!
3//! The web crawling engine for the scrapling-rs framework. This crate provides the
4//! orchestration layer that ties together HTTP fetching, request scheduling, response
5//! caching, robots.txt compliance, and checkpoint-based pause/resume into a single
6//! crawl loop.
7//!
8//! ## Architecture
9//!
10//! A crawl is driven by two main abstractions:
11//!
12//! - **[`Spider`]** -- a trait you implement to define *what* to crawl. You supply
13//! start URLs, a `parse` function that extracts data and follow-up links from each
14//! response, and optional configuration knobs (concurrency, download delay, allowed
15//! domains, etc.).
16//!
17//! - **[`CrawlerEngine`]** -- the runtime that executes the crawl loop. It owns a
18//! [`Scheduler`] (priority queue with deduplication), a [`SessionManager`] (named
19//! HTTP sessions), and optional managers for robots.txt, response caching, and
20//! checkpointing. You create an engine, hand it your `Spider`, and call
21//! [`CrawlerEngine::crawl`].
22//!
23//! ## Module overview
24//!
25//! | Module | Purpose |
26//! |--------|---------|
27//! | [`spider`] | The `Spider` trait and `CrawlerEngine` crawl loop |
28//! | [`request`] | `Request`, `Callback`, and `SpiderOutput` types |
29//! | [`result`] | `CrawlResult`, `CrawlStats`, and `ItemList` output types |
30//! | [`scheduler`] | Priority-queue scheduler with fingerprint deduplication |
31//! | [`session`] | `Session` / `SessionManager` for named HTTP backends |
32//! | [`cache`] | Filesystem-based HTTP response cache for dev mode |
33//! | [`checkpoint`] | Pause/resume support via JSON snapshots on disk |
34//! | [`robotstxt`] | Fetching and enforcing robots.txt rules per domain |
35//! | [`error`] | `SpiderError` enum and `Result` alias |
36//! | [`logging`] | Thread-safe log-level counter for crawl statistics |
37//!
38//! ## Quick start
39//!
40//! ```rust,ignore
41//! use scrapling_spider::{Spider, CrawlerEngine, Request, SpiderOutput};
42//! use scrapling_fetch::Response;
43//!
44//! struct MyScraper;
45//!
46//! impl Spider for MyScraper {
47//! fn name(&self) -> &str { "my_scraper" }
48//! fn start_urls(&self) -> Vec<String> {
49//! vec!["https://example.com".into()]
50//! }
51//! fn parse(&self, response: Response) -> Vec<SpiderOutput> {
52//! // Extract data or follow links here
53//! vec![]
54//! }
55//! }
56//!
57//! # async fn run() {
58//! let spider = MyScraper;
59//! let mut engine = CrawlerEngine::new(&spider, None, 0.0).unwrap();
60//! let stats = engine.crawl().await.unwrap();
61//! println!("Scraped {} items", stats.items_scraped);
62//! # }
63//! ```
64
65pub mod cache;
66pub mod checkpoint;
67pub mod error;
68pub mod logging;
69pub mod request;
70pub mod result;
71pub mod robotstxt;
72pub mod scheduler;
73pub mod session;
74pub mod spider;
75
76pub use error::{Result, SpiderError};
77pub use request::{Callback, Request, SpiderOutput};
78pub use result::{CrawlResult, CrawlStats, ItemList};
79pub use scheduler::Scheduler;
80pub use session::{Session, SessionManager};
81pub use spider::{CrawlerEngine, Spider};