1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
//! # spider-lib
//!
//! `spider-lib` is the easiest way to use this workspace as an application
//! framework. It re-exports the crawler runtime, common middleware and
//! pipelines, shared request and response types, and the `#[scraped_item]`
//! macro behind one crate.
//!
//! If you want the lower-level pieces individually, the workspace also exposes
//! `spider-core`, `spider-middleware`, `spider-pipeline`, `spider-downloader`,
//! `spider-macro`, and `spider-util`. Most users should start here.
//!
//! ## What you get from the facade crate
//!
//! The root crate is optimized for application authors:
//!
//! - [`prelude`] re-exports the common types needed to define and run a spider
//! - [`Spider`] describes crawl behavior
//! - [`CrawlerBuilder`] assembles the runtime
//! - [`Request`], [`Response`], and [`ParseOutput`] are the core runtime data types
//! - [`Response::css`](spider_util::response::Response::css) provides Scrapy-like builtin selectors
//! - middleware and pipelines can be enabled with feature flags and then added
//! through the builder
//!
//! ## Installation
//!
//! ```toml
//! [dependencies]
//! spider-lib = "3.0.2"
//! serde = { version = "1.0", features = ["derive"] }
//! serde_json = "1.0"
//! ```
//!
//! `serde` and `serde_json` are required when you use [`scraped_item`].
//!
//! ## Quick start
//!
//! ```rust,ignore
//! use spider_lib::prelude::*;
//!
//! #[scraped_item]
//! struct Quote {
//! text: String,
//! author: String,
//! }
//!
//! struct QuotesSpider;
//!
//! #[async_trait]
//! impl Spider for QuotesSpider {
//! type Item = Quote;
//! type State = ();
//!
//! fn start_requests(&self) -> Result<StartRequests<'_>, SpiderError> {
//! Ok(StartRequests::Urls(vec!["https://quotes.toscrape.com/"]))
//! }
//!
//! async fn parse(
//! &self,
//! response: Response,
//! _state: &Self::State,
//! ) -> Result<ParseOutput<Self::Item>, SpiderError> {
//! let mut output = ParseOutput::new();
//!
//! for quote in response.css(".quote")? {
//! let text = quote
//! .css(".text::text")?
//! .get()
//! .unwrap_or_default();
//!
//! let author = quote
//! .css(".author::text")?
//! .get()
//! .unwrap_or_default();
//!
//! output.add_item(Quote { text, author });
//! }
//!
//! Ok(output)
//! }
//! }
//!
//! #[tokio::main]
//! async fn main() -> Result<(), SpiderError> {
//! let crawler = CrawlerBuilder::new(QuotesSpider).build().await?;
//! crawler.start_crawl().await
//! }
//! ```
//!
//! The built-in selector API is the recommended path for HTML extraction:
//! `response.css(".card")?`, `node.css("a::attr(href)")?.get()`, and
//! `node.css(".title::text")?.get()`.
//!
//! [`Spider::parse`] takes `&self` and a separate shared state parameter.
//! That design keeps the spider itself immutable while still allowing
//! concurrent parsing with user-defined shared state.
//!
//! ## Typical next steps
//!
//! After the minimal spider works, the next additions are usually:
//!
//! 1. add one or more middleware with [`CrawlerBuilder::add_middleware`]
//! 2. add one or more pipelines with [`CrawlerBuilder::add_pipeline`]
//! 3. move repeated parse-time state into [`Spider::State`]
//! 4. enable optional features such as `live-stats`, `pipeline-csv`, or
//! `middleware-robots`
//!
//! If you find yourself needing transport-level customization, custom
//! middleware contracts, or lower-level runtime control, move down to the
//! crate-specific APIs in `spider-core`, `spider-downloader`,
//! `spider-middleware`, or `spider-pipeline`.
extern crate self as spider_lib;
/// Re-export the application-facing prelude.
///
/// Most examples and first integrations start with:
///
/// ```rust
/// use spider_lib::prelude::*;
/// ```
pub use *;
pub use route_by_rule;
// Re-export procedural macros
pub use scraped_item;