Skip to main content

spider/
lib.rs

1#![warn(missing_docs)]
2#![allow(clippy::perf)]
3//! Website crawling library that rapidly crawls all pages to
4//! gather links via isolated contexts.
5//!
6//! Spider is multi-threaded crawler that can be configured
7//! to scrape web pages. It has the ability to gather
8//! millions of pages within seconds.
9//!
10//! # How to use Spider
11//!
12//! There are a couple of ways to use Spider:
13//!
14//! - [`crawl`]: start concurrently crawling a site. Can be used to send each page (including URL
15//!   and HTML) to a subscriber for processing, or just to gather links.
16//!
17//! - [`scrape`]: like `crawl`, but saves the HTML raw strings to parse after scraping is complete.
18//!
19//! [`crawl`]: website/struct.Website.html#method.crawl
20//! [`scrape`]: website/struct.Website.html#method.scrape
21//!
22//! # Examples
23//!
24//! A simple crawl to index a website:
25//!
26//! ```no_run
27//! use spider::tokio;
28//! use spider::website::Website;
29//!
30//! #[tokio::main]
31//! async fn main() {
32//!     let mut website: Website = Website::new("https://spider.cloud");
33//!
34//!     website.crawl().await;
35//!
36//!     let links = website.get_links();
37//!
38//!     for link in links {
39//!         println!("- {:?}", link.as_ref());
40//!     }
41//! }
42//! ```
43//!
44//! Subscribe to crawl events:
45//!
46//! ```no_run
47//! use spider::tokio;
48//! use spider::website::Website;
49//!
50//! #[tokio::main]
51//! async fn main() {
52//!     let mut website: Website = Website::new("https://spider.cloud");
53//!     let mut rx2 = website.subscribe(16).unwrap();
54//!
55//!     tokio::spawn(async move {
56//!         while let Ok(res) = rx2.recv().await {
57//!             println!("- {}", res.get_url());
58//!         }
59//!     });
60//!
61//!     website.crawl().await;
62//! }
63//! ```
64//!
65//! ## Feature flags
66//!
67//! - `ua_generator`: Enables auto generating a random real User-Agent.
68//! - `disk`: Enables SQLite hybrid disk storage to balance memory usage with no tls.
69//! - `disk_native_tls`: Enables SQLite hybrid disk storage to balance memory usage with native tls.
70//! - `disk_aws`: Enables SQLite hybrid disk storage to balance memory usage with aws_tls.
71//! - `balance`: Enables balancing the CPU and memory to scale more efficiently.
72//! - `regex`: Enables blacklisting paths with regx.
73//! - `firewall`: Enables spider_firewall crate to prevent bad websites from crawling.
74//! - `decentralized`: Enables decentralized processing of IO, requires the [spider_worker](https://docs.rs/crate/spider_worker/latest) startup before crawls.
75//! - `sync`: Subscribe to changes for Page data processing async.
76//! - `control`: Enables the ability to pause, start, and shutdown crawls on demand.
77//! - `full_resources`: Enables gathering all content that relates to the domain like css,jss, and etc.
78//! - `serde`: Enables serde serialization support.
79//! - `socks`: Enables socks5 proxy support.
80//! - `glob`: Enables [url glob](https://everything.curl.dev/cmdline/globbing) support.
81//! - `fs`: Enables storing resources to disk for parsing (may greatly increases performance at the cost of temp storage). Enabled by default.
82//! - `sitemap`: Include sitemap pages in results.
83//! - `time`: Enables duration tracking per page.
84//! - `cache`: Enables HTTP caching request to disk.
85//! - `cache_mem`: Enables HTTP caching request to persist in memory.
86//! - `cache_chrome_hybrid`: Enables hybrid chrome request caching between HTTP.
87//! - `cache_openai`: Enables caching the OpenAI request. This can drastically save costs when developing AI workflows.
88//! - `chrome`: Enables chrome headless rendering, use the env var `CHROME_URL` to connect remotely.
89//! - `chrome_headed`: Enables chrome rendering headful rendering.
90//! - `chrome_cpu`: Disable gpu usage for chrome browser.
91//! - `chrome_stealth`: Enables stealth mode to make it harder to be detected as a bot.
92//! - `chrome_store_page`: Store the page object to perform other actions like taking screenshots conditionally.
93//! - `chrome_screenshot`: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the ./storage/ directory. Use the env variable `SCREENSHOT_DIRECTORY` to adjust the directory.
94//! - `chrome_intercept`: Allows intercepting network request to speed up processing.
95//! - `chrome_headless_new`: Use headless=new to launch the chrome instance.
96//! - `cookies`: Enables cookies storing and setting to use for request.
97//! - `real_browser`: Enables the ability to bypass protected pages.
98//! - `cron`: Enables the ability to start cron jobs for the website.
99//! - `openai`: Enables OpenAI to generate dynamic browser executable scripts. Make sure to use the env var `OPENAI_API_KEY`.
100//! - `gemini`: Enables Gemini AI to generate dynamic browser executable scripts. Make sure to use the env var `GEMINI_API_KEY`.
101//! - `smart`: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network request by re-using the content.
102//! - `encoding`: Enables handling the content with different encodings like Shift_JIS.
103//! - `spoof`: Spoof HTTP headers for the request.
104//! - `headers`: Enables the extraction of header information on each retrieved page. Adds a `headers` field to the page struct.
105//! - `decentralized_headers`: Enables the extraction of suppressed header information of the decentralized processing of IO. This is needed if `headers` is set in both [spider](https://docs.rs/spider/latest/spider/) and [spider_worker](https://docs.rs/crate/spider_worker/latest).
106//!
107//! Additional learning resources include:
108//!
109//! - [Spider Repository Examples](https://github.com/spider-rs/spider/tree/main/examples)
110pub extern crate bytes;
111pub extern crate case_insensitive_string;
112pub extern crate hashbrown;
113extern crate log;
114pub extern crate percent_encoding;
115pub extern crate quick_xml;
116pub extern crate reqwest;
117pub extern crate smallvec;
118pub extern crate spider_fingerprint;
119pub extern crate tokio;
120pub extern crate tokio_stream;
121pub extern crate url;
122
123#[cfg(feature = "cron")]
124pub extern crate async_job;
125#[cfg(feature = "openai")]
126pub extern crate async_openai;
127pub extern crate auto_encoder;
128#[cfg(feature = "flexbuffers")]
129pub extern crate flexbuffers;
130#[cfg(feature = "gemini")]
131pub extern crate gemini_rust;
132#[cfg(feature = "cache_request")]
133pub extern crate http_cache_reqwest;
134#[cfg(feature = "cache_openai")]
135pub extern crate moka;
136#[cfg(feature = "cache_request")]
137pub extern crate reqwest_middleware;
138#[cfg(feature = "serde")]
139pub extern crate serde;
140#[cfg(feature = "ua_generator")]
141pub extern crate ua_generator;
142#[macro_use]
143pub extern crate string_concat;
144pub extern crate strum;
145#[macro_use]
146pub extern crate lazy_static;
147#[cfg(feature = "firewall")]
148pub extern crate spider_firewall;
149#[cfg(feature = "agent")]
150pub extern crate spider_agent;
151
152/// Re-export agent types from spider_agent crate.
153#[cfg(feature = "agent")]
154pub mod agent {
155    //! Agent module re-exports from spider_agent crate.
156    //!
157    //! This provides convenient access to the multimodal agent functionality.
158    pub use spider_agent::{
159        Agent, AgentBuilder, AgentConfig, AgentError, AgentMemory, AgentResult,
160        FetchResult, HtmlCleaningMode, LimitType, Message, RetryConfig,
161        UsageLimits, UsageSnapshot, UsageStats,
162        // Custom tool types
163        AuthConfig, CustomTool, CustomToolRegistry, CustomToolResult, HttpMethod,
164    };
165
166    #[cfg(feature = "agent_openai")]
167    pub use spider_agent::OpenAIProvider;
168
169    #[cfg(feature = "agent_chrome")]
170    pub use spider_agent::BrowserContext;
171
172    #[cfg(feature = "agent_webdriver")]
173    pub use spider_agent::WebDriverContext;
174
175    #[cfg(feature = "agent_fs")]
176    pub use spider_agent::{TempStorage, TempFile};
177
178    #[cfg(any(
179        feature = "agent_search_serper",
180        feature = "agent_search_brave",
181        feature = "agent_search_bing",
182        feature = "agent_search_tavily"
183    ))]
184    pub use spider_agent::{
185        ResearchOptions, ResearchResult, SearchOptions, SearchProvider,
186        SearchResult, SearchResults, TimeRange,
187    };
188
189    #[cfg(feature = "agent_search_serper")]
190    pub use spider_agent::SerperProvider;
191
192    #[cfg(feature = "agent_search_brave")]
193    pub use spider_agent::BraveProvider;
194
195    #[cfg(feature = "agent_search_bing")]
196    pub use spider_agent::BingProvider;
197
198    #[cfg(feature = "agent_search_tavily")]
199    pub use spider_agent::TavilyProvider;
200}
201
202/// Client interface.
203pub mod client;
204/// Configuration structure for `Website`.
205pub mod configuration;
206/// Optional features to use.
207pub mod features;
208/// Internal packages customized.
209pub mod packages;
210/// A page scraped.
211pub mod page;
212/// Application utils.
213pub mod utils;
214/// A website to crawl.
215pub mod website;
216
217pub use case_insensitive_string::compact_str;
218pub use case_insensitive_string::CaseInsensitiveString;
219pub use client::{Client, ClientBuilder};
220
221#[cfg(feature = "chrome")]
222pub use chromiumoxide;
223
224#[cfg(feature = "search")]
225pub use features::search;
226#[cfg(feature = "search")]
227pub use features::search_providers;
228
229#[cfg(feature = "regex")]
230/// Black list checking url exist with Regex.
231pub mod black_list {
232    use crate::compact_str::CompactString;
233    /// check if link exist in blacklists with regex.
234    pub fn contains(blacklist_url: &regex::RegexSet, link: &CompactString) -> bool {
235        blacklist_url.is_match(link)
236    }
237}
238
239#[cfg(not(feature = "regex"))]
240/// Black list checking url exist.
241pub mod black_list {
242    use crate::compact_str::CompactString;
243    /// check if link exist in blacklists.
244    pub fn contains(blacklist_url: &[CompactString], link: &CompactString) -> bool {
245        blacklist_url.contains(link)
246    }
247}
248
249/// The selectors type. The values are held to make sure the relative domain can be crawled upon base redirects.
250pub type RelativeSelectors = (
251    // base domain
252    compact_str::CompactString,
253    smallvec::SmallVec<[compact_str::CompactString; 2]>,
254    // redirected domain
255    compact_str::CompactString,
256);