Skip to main content

spider_lib/
lib.rs

1//! # spider-lib
2//!
3//! `spider-lib` is the easiest way to use this workspace as an application
4//! framework. It re-exports the crawler runtime, common middleware and
5//! pipelines, shared request and response types, and the `#[scraped_item]`
6//! macro behind one crate.
7//!
8//! If you want the lower-level pieces individually, the workspace also exposes
9//! `spider-core`, `spider-middleware`, `spider-pipeline`, `spider-downloader`,
10//! `spider-macro`, and `spider-util`. Most users should start here.
11//!
12//! ## What you get from the facade crate
13//!
14//! The root crate is optimized for application authors:
15//!
16//! - [`prelude`] re-exports the common types needed to define and run a spider
17//! - [`Spider`] describes crawl behavior
18//! - [`CrawlerBuilder`] assembles the runtime
19//! - [`Request`], [`Response`], and [`ParseOutput`] are the core runtime data types
20//! - middleware and pipelines can be enabled with feature flags and then added
21//!   through the builder
22//!
23//! ## Installation
24//!
25//! ```toml
26//! [dependencies]
27//! spider-lib = "3.0.2"
28//! serde = { version = "1.0", features = ["derive"] }
29//! serde_json = "1.0"
30//! ```
31//!
32//! `serde` and `serde_json` are required when you use [`scraped_item`].
33//!
34//! ## Quick start
35//!
36//! ```rust,ignore
37//! use spider_lib::prelude::*;
38//!
39//! #[scraped_item]
40//! struct Quote {
41//!     text: String,
42//!     author: String,
43//! }
44//!
45//! struct QuotesSpider;
46//!
47//! #[async_trait]
48//! impl Spider for QuotesSpider {
49//!     type Item = Quote;
50//!     type State = ();
51//!
52//!     fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
53//!         Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
54//!     }
55//!
56//!     async fn parse(
57//!         &self,
58//!         response: Response,
59//!         _state: &Self::State,
60//!     ) -> Result<ParseOutput<Self::Item>, SpiderError> {
61//!         let html = response.to_html()?;
62//!         let mut output = ParseOutput::new();
63//!
64//!         for quote in html.select(&".quote".to_selector()?) {
65//!             let text = quote
66//!                 .select(&".text".to_selector()?)
67//!                 .next()
68//!                 .map(|node| node.text().collect::<String>())
69//!                 .unwrap_or_default();
70//!
71//!             let author = quote
72//!                 .select(&".author".to_selector()?)
73//!                 .next()
74//!                 .map(|node| node.text().collect::<String>())
75//!                 .unwrap_or_default();
76//!
77//!             output.add_item(Quote { text, author });
78//!         }
79//!
80//!         Ok(output)
81//!     }
82//! }
83//!
84//! #[tokio::main]
85//! async fn main() -> Result<(), SpiderError> {
86//!     let crawler = CrawlerBuilder::new(QuotesSpider).build().await?;
87//!     crawler.start_crawl().await
88//! }
89//! ```
90//!
91//! [`Spider::parse`] takes `&self` and a separate shared state parameter.
92//! That design keeps the spider itself immutable while still allowing
93//! concurrent parsing with user-defined shared state.
94//!
95//! ## Typical next steps
96//!
97//! After the minimal spider works, the next additions are usually:
98//!
99//! 1. add one or more middleware with [`CrawlerBuilder::add_middleware`]
100//! 2. add one or more pipelines with [`CrawlerBuilder::add_pipeline`]
101//! 3. move repeated parse-time state into [`Spider::State`]
102//! 4. enable optional features such as `live-stats`, `pipeline-csv`, or
103//!    `middleware-robots`
104//!
105//! If you find yourself needing transport-level customization, custom
106//! middleware contracts, or lower-level runtime control, move down to the
107//! crate-specific APIs in `spider-core`, `spider-downloader`,
108//! `spider-middleware`, or `spider-pipeline`.
109
110pub mod prelude;
111/// Re-export the application-facing prelude.
112///
113/// Most examples and first integrations start with:
114///
115/// ```rust
116/// use spider_lib::prelude::*;
117/// ```
118pub use prelude::*;
119
120// Re-export procedural macros
121pub use spider_macro::scraped_item;