Skip to main content

spider_lib/
lib.rs

1//! # spider-lib
2//!
3//! `spider-lib` is the easiest way to use this workspace as an application
4//! framework. It re-exports the crawler runtime, common middleware and
5//! pipelines, shared request and response types, and the `#[scraped_item]`
6//! macro behind one crate.
7//!
8//! If you want the lower-level pieces individually, the workspace also exposes
9//! `spider-core`, `spider-middleware`, `spider-pipeline`, `spider-downloader`,
10//! `spider-macro`, and `spider-util`. Most users should start here.
11//!
12//! ## What you get from the facade crate
13//!
14//! The root crate is optimized for application authors:
15//!
16//! - [`prelude`] re-exports the common types needed to define and run a spider
17//! - [`Spider`] describes crawl behavior
18//! - [`CrawlerBuilder`] assembles the runtime
19//! - [`Request`], [`Response`], and [`ParseOutput`] are the core runtime data types
20//! - [`Response::css`](spider_util::response::Response::css) provides Scrapy-like builtin selectors
21//! - middleware and pipelines can be enabled with feature flags and then added
22//!   through the builder
23//!
24//! ## Installation
25//!
26//! ```toml
27//! [dependencies]
28//! spider-lib = "3.0.2"
29//! serde = { version = "1.0", features = ["derive"] }
30//! serde_json = "1.0"
31//! ```
32//!
33//! `serde` and `serde_json` are required when you use [`scraped_item`].
34//!
35//! ## Quick start
36//!
37//! ```rust,ignore
38//! use spider_lib::prelude::*;
39//!
40//! #[scraped_item]
41//! struct Quote {
42//!     text: String,
43//!     author: String,
44//! }
45//!
46//! struct QuotesSpider;
47//!
48//! #[async_trait]
49//! impl Spider for QuotesSpider {
50//!     type Item = Quote;
51//!     type State = ();
52//!
53//!     fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
54//!         Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
55//!     }
56//!
57//!     async fn parse(
58//!         &self,
59//!         response: Response,
60//!         _state: &Self::State,
61//!     ) -> Result<ParseOutput<Self::Item>, SpiderError> {
62//!         let mut output = ParseOutput::new();
63//!
64//!         for quote in response.css(".quote")? {
65//!             let text = quote
66//!                 .css(".text::text")?
67//!                 .get()
68//!                 .unwrap_or_default();
69//!
70//!             let author = quote
71//!                 .css(".author::text")?
72//!                 .get()
73//!                 .unwrap_or_default();
74//!
75//!             output.add_item(Quote { text, author });
76//!         }
77//!
78//!         Ok(output)
79//!     }
80//! }
81//!
82//! #[tokio::main]
83//! async fn main() -> Result<(), SpiderError> {
84//!     let crawler = CrawlerBuilder::new(QuotesSpider).build().await?;
85//!     crawler.start_crawl().await
86//! }
87//! ```
88//!
89//! The built-in selector API is the recommended path for HTML extraction:
90//! `response.css(".card")?`, `node.css("a::attr(href)")?.get()`, and
91//! `node.css(".title::text")?.get()`.
92//!
93//! [`Spider::parse`] takes `&self` and a separate shared state parameter.
94//! That design keeps the spider itself immutable while still allowing
95//! concurrent parsing with user-defined shared state.
96//!
97//! ## Typical next steps
98//!
99//! After the minimal spider works, the next additions are usually:
100//!
101//! 1. add one or more middleware with [`CrawlerBuilder::add_middleware`]
102//! 2. add one or more pipelines with [`CrawlerBuilder::add_pipeline`]
103//! 3. move repeated parse-time state into [`Spider::State`]
104//! 4. enable optional features such as `live-stats`, `pipeline-csv`, or
105//!    `middleware-robots`
106//!
107//! If you find yourself needing transport-level customization, custom
108//! middleware contracts, or lower-level runtime control, move down to the
109//! crate-specific APIs in `spider-core`, `spider-downloader`,
110//! `spider-middleware`, or `spider-pipeline`.
111
112extern crate self as spider_lib;
113
114pub mod prelude;
115/// Re-export the application-facing prelude.
116///
117/// Most examples and first integrations start with:
118///
119/// ```rust
120/// use spider_lib::prelude::*;
121/// ```
122pub use prelude::*;
123pub use spider_core::route_by_rule;
124
125// Re-export procedural macros
126pub use spider_macro::scraped_item;