spider_lib/lib.rs
1//! # spider-lib
2//!
3//! `spider-lib` is the easiest way to use this workspace as an application
4//! framework. It re-exports the crawler runtime, common middleware and
5//! pipelines, shared request and response types, and the `#[scraped_item]`
6//! macro behind one crate.
7//!
8//! If you want the lower-level pieces individually, the workspace also exposes
9//! `spider-core`, `spider-middleware`, `spider-pipeline`, `spider-downloader`,
10//! `spider-macro`, and `spider-util`. Most users should start here.
11//!
12//! ## What you get from the facade crate
13//!
14//! The root crate is optimized for application authors:
15//!
16//! - [`prelude`] re-exports the common types needed to define and run a spider
17//! - [`Spider`] describes crawl behavior
18//! - [`CrawlerBuilder`] assembles the runtime
19//! - [`Request`], [`Response`], and [`ParseOutput`] are the core runtime data types
20//! - [`Response::css`](spider_util::response::Response::css) provides Scrapy-like builtin selectors
21//! - middleware and pipelines can be enabled with feature flags and then added
22//! through the builder
23//!
24//! ## Installation
25//!
26//! ```toml
27//! [dependencies]
28//! spider-lib = "3.0.2"
29//! serde = { version = "1.0", features = ["derive"] }
30//! serde_json = "1.0"
31//! ```
32//!
33//! `serde` and `serde_json` are required when you use [`scraped_item`].
34//!
35//! ## Quick start
36//!
37//! ```rust,ignore
38//! use spider_lib::prelude::*;
39//!
40//! #[scraped_item]
41//! struct Quote {
42//! text: String,
43//! author: String,
44//! }
45//!
46//! struct QuotesSpider;
47//!
48//! #[async_trait]
49//! impl Spider for QuotesSpider {
50//! type Item = Quote;
51//! type State = ();
52//!
53//! fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
54//! Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
55//! }
56//!
57//! async fn parse(
58//! &self,
59//! response: Response,
60//! _state: &Self::State,
61//! ) -> Result<ParseOutput<Self::Item>, SpiderError> {
62//! let mut output = ParseOutput::new();
63//!
64//! for quote in response.css(".quote")? {
65//! let text = quote
66//! .css(".text::text")?
67//! .get()
68//! .unwrap_or_default();
69//!
70//! let author = quote
71//! .css(".author::text")?
72//! .get()
73//! .unwrap_or_default();
74//!
75//! output.add_item(Quote { text, author });
76//! }
77//!
78//! Ok(output)
79//! }
80//! }
81//!
82//! #[tokio::main]
83//! async fn main() -> Result<(), SpiderError> {
84//! let crawler = CrawlerBuilder::new(QuotesSpider).build().await?;
85//! crawler.start_crawl().await
86//! }
87//! ```
88//!
89//! The built-in selector API is the recommended path for HTML extraction:
90//! `response.css(".card")?`, `node.css("a::attr(href)")?.get()`, and
91//! `node.css(".title::text")?.get()`.
92//!
93//! [`Spider::parse`] takes `&self` and a separate shared state parameter.
94//! That design keeps the spider itself immutable while still allowing
95//! concurrent parsing with user-defined shared state.
96//!
97//! ## Typical next steps
98//!
99//! After the minimal spider works, the next additions are usually:
100//!
101//! 1. add one or more middleware with [`CrawlerBuilder::add_middleware`]
102//! 2. add one or more pipelines with [`CrawlerBuilder::add_pipeline`]
103//! 3. move repeated parse-time state into [`Spider::State`]
104//! 4. enable optional features such as `live-stats`, `pipeline-csv`, or
105//! `middleware-robots`
106//!
107//! If you find yourself needing transport-level customization, custom
108//! middleware contracts, or lower-level runtime control, move down to the
109//! crate-specific APIs in `spider-core`, `spider-downloader`,
110//! `spider-middleware`, or `spider-pipeline`.
111
112extern crate self as spider_lib;
113
114pub mod prelude;
115/// Re-export the application-facing prelude.
116///
117/// Most examples and first integrations start with:
118///
119/// ```rust
120/// use spider_lib::prelude::*;
121/// ```
122pub use prelude::*;
123pub use spider_core::route_by_rule;
124
125// Re-export procedural macros
126pub use spider_macro::scraped_item;