Expand description
Website crawling library that rapidly crawls all pages to gather links via isolated contexts.
Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather millions of pages within seconds.
§How to use Spider
There are a couple of ways to use Spider:
-
crawl
: start concurrently crawling a site. Can be used to send each page (including URL and HTML) to a subscriber for processing, or just to gather links. -
scrape
: likecrawl
, but saves the HTML raw strings to parse after scraping is complete.
§Examples
A simple crawl to index a website:
use spider::tokio;
use spider::website::Website;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://spider.cloud");
website.crawl().await;
let links = website.get_links();
for link in links {
println!("- {:?}", link.as_ref());
}
}
Subscribe to crawl events:
use spider::tokio;
use spider::website::Website;
use tokio::io::AsyncWriteExt;
#[tokio::main]
async fn main() {
let mut website: Website = Website::new("https://spider.cloud");
let mut rx2 = website.subscribe(16).unwrap();
tokio::spawn(async move {
let mut stdout = tokio::io::stdout();
while let Ok(res) = rx2.recv().await {
let _ = stdout
.write_all(format!("- {}\n", res.get_url()).as_bytes())
.await;
}
});
website.crawl().await;
}
§Feature flags
ua_generator
: Enables auto generating a random real User-Agent.disk
: Enables SQLite hybrid disk storage to balance memory usage with no tls.disk_native_tls
: Enables SQLite hybrid disk storage to balance memory usage with native tls.disk_aws
: Enables SQLite hybrid disk storage to balance memory usage with aws_tls.balance
: Enables balancing the CPU and memory to scale more efficiently.regex
: Enables blacklisting paths with regxjemalloc
: Enables the jemalloc memory backend.decentralized
: Enables decentralized processing of IO, requires the spider_worker startup before crawls.sync
: Subscribe to changes for Page data processing async.control
: Enables the ability to pause, start, and shutdown crawls on demand.full_resources
: Enables gathering all content that relates to the domain like css,jss, and etc.serde
: Enables serde serialization support.socks
: Enables socks5 proxy support.glob
: Enables url glob support.fs
: Enables storing resources to disk for parsing (may greatly increases performance at the cost of temp storage). Enabled by default.sitemap
: Include sitemap pages in results.time
: Enables duration tracking per page.cache
: Enables HTTP caching request to disk.cache_mem
: Enables HTTP caching request to persist in memory.cache_chrome_hybrid
: Enables hybrid chrome request caching between HTTP.cache_openai
: Enables caching the OpenAI request. This can drastically save costs when developing AI workflows.chrome
: Enables chrome headless rendering, use the env varCHROME_URL
to connect remotely.chrome_headed
: Enables chrome rendering headful rendering.chrome_cpu
: Disable gpu usage for chrome browser.chrome_stealth
: Enables stealth mode to make it harder to be detected as a bot.chrome_store_page
: Store the page object to perform other actions like taking screenshots conditionally.chrome_screenshot
: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the ./storage/ directory. Use the env variableSCREENSHOT_DIRECTORY
to adjust the directory.chrome_intercept
: Allows intercepting network request to speed up processing.chrome_headless_new
: Use headless=new to launch the chrome instance.cookies
: Enables cookies storing and setting to use for request.real_browser
: Enables the ability to bypass protected pages.cron
: Enables the ability to start cron jobs for the website.openai
: Enables OpenAI to generate dynamic browser executable scripts. Make sure to use the env varOPENAI_API_KEY
.smart
: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network request by re-using the content.encoding
: Enables handling the content with different encodings like Shift_JIS.spoof
: Spoof HTTP headers for the request.headers
: Enables the extraction of header information on each retrieved page. Adds aheaders
field to the page struct.decentralized_headers
: Enables the extraction of suppressed header information of the decentralized processing of IO. This is needed ifheaders
is set in both spider and spider_worker.
Additional learning resources include:
Re-exports§
pub extern crate auto_encoder;
pub extern crate bytes;
pub extern crate case_insensitive_string;
pub extern crate hashbrown;
pub extern crate lazy_static;
pub extern crate percent_encoding;
pub extern crate quick_xml;
pub extern crate reqwest;
pub extern crate smallvec;
pub extern crate string_concat;
pub extern crate strum;
pub extern crate tokio;
pub extern crate tokio_stream;
pub extern crate ua_generator;
pub extern crate url;
pub use case_insensitive_string::compact_str;
Modules§
- black_
list - Black list checking url exist.
- configuration
- Configuration structure for
Website
. - features
- Optional features to use.
- packages
- Internal packages customized.
- page
- A page scraped.
- utils
- Application utils.
- website
- A website to crawl.
Structs§
- Case
Insensitive String - case-insensitive string handling
Type Aliases§
- Client
- The asynchronous Client to make requests with.
- Client
Builder - The asynchronous Client Builder.
- Relative
Selectors - The selectors type. The values are held to make sure the relative domain can be crawled upon base redirects.