Spider
Multithreaded async crawler/indexer using isolates and IPC channels for communication with the ability to run decentralized.
Dependencies
On Linux
- OpenSSL 1.0.1, 1.0.2, 1.1.0, or 1.1.1
Example
This is a basic async example crawling a web page, add spider to your Cargo.toml:
[]
= "1.60.7"
And then the code:
extern crate spider;
use Website;
use tokio;
async
You can use Configuration object to configure your crawler:
// ..
let mut website = new;
website.configuration.respect_robots_txt = true;
website.configuration.subdomains = true;
website.configuration.tld = false;
website.configuration.delay = 0; // Defaults to 0 ms due to concurrency handling
website.configuration.request_timeout = None; // Defaults to 15000 ms
website.configuration.http2_prior_knowledge = false; // Enable if you know the webserver supports http2
website.configuration.user_agent = Some; // Defaults to using a random agent
website.on_link_find_callback = Some; // Callback to run on each link find - useful for mutating the url, ex: convert the top level domain from `.fr` to `.es`.
website.configuration.blacklist_url.get_or_insert.push;
website.configuration.proxies.get_or_insert.push; // Defaults to None - proxy list.
website.budget = Some; // Defaults to None - Requires the `budget` feature flag
website.cron_str = "1/5 * * * * *".into; // Defaults to empty string - Requires the `cron` feature flag
website.cron_type = Crawl; // Defaults to CronType::Crawl - Requires the `cron` feature flag
website.crawl.await;
The builder pattern is also available v1.33.0 and up:
let mut website = new;
website
.with_respect_robots_txt
.with_subdomains
.with_tld
.with_delay
.with_request_timeout
.with_http2_prior_knowledge
.with_user_agent
// requires the `budget` feature flag
.with_budget
.with_external_domains
.with_headers
.with_blacklist_url
// requires the `cron` feature flag
.with_cron;
.with_proxies;
Features
We have a couple optional feature flags. Regex blacklisting, jemaloc backend, globbing, fs temp storage, decentralization, serde, gathering full assets, and randomizing user agents.
[]
= { = "1.60.7", = ["regex", "ua_generator"] }
ua_generator: Enables auto generating a random real User-Agent.regex: Enables blacklisting paths with regxjemalloc: Enables the jemalloc memory backend.decentralized: Enables decentralized processing of IO, requires the spider_worker startup before crawls.sync: Subscribe to changes for Page data processing async. [Enabled by default]budget: Allows setting a crawl budget per path with depth.control: Enables the ability to pause, start, and shutdown crawls on demand.full_resources: Enables gathering all content that relates to the domain like CSS, JS, and etc.serde: Enables serde serialization support.socks: Enables socks5 proxy support.glob: Enables url glob support.fs: Enables storing resources to disk for parsing (may greatly increases performance at the cost of temp storage).js: Enables javascript parsing links created with the alpha jsdom crate.sitemap: Include sitemap pages in results.time: Enables duration tracking per page.chrome: Enables chrome headless rendering, use the env varCHROME_URLto connect remotely.chrome_screenshot: Enables storing a screenshot of each page on crawl. Defaults the screenshots to the ./storage/ directory. Use the env variableSCREENSHOT_DIRECTORYto adjust the directory. To save the background set the env varSCREENSHOT_OMIT_BACKGROUNDto false.chrome_headed: Enables chrome rendering headful rendering [experimental].chrome_cpu: Disable gpu usage for chrome browser.chrome_stealth: Enables stealth mode to make it harder to be detected as a bot.cookies: Enables cookies storing and setting to use for request.cron: Enables the ability to start cron jobs for the website.smart: Enables smart mode. This runs request as HTTP until JavaScript rendering is needed. This avoids sending multiple network request by re-using the content.
Decentralization
Move processing to a worker, drastically increases performance even if worker is on the same machine due to efficient runtime split IO work.
[]
= { = "1.60.7", = ["decentralized"] }
# install the worker
# start the worker [set the worker on another machine in prod]
RUST_LOG=info SPIDER_WORKER_PORT=3030
# start rust project as normal with SPIDER_WORKER env variable
SPIDER_WORKER=http://127.0.0.1:3030
The SPIDER_WORKER env variable takes a comma seperated list of urls to set the workers. If the scrape feature flag is enabled, use the SPIDER_WORKER_SCRAPER env variable to determine the scraper worker.
Subscribe to changes
Use the subscribe method to get a broadcast channel.
[]
= { = "1.60.7", = ["sync"] }
extern crate spider;
use Website;
use tokio;
async
Regex Blacklisting
Allow regex for blacklisting routes
[]
= { = "1.60.7", = ["regex"] }
extern crate spider;
use Website;
use tokio;
async
Pause, Resume, and Shutdown
If you are performing large workloads you may need to control the crawler by enabling the control feature flag:
[]
= { = "1.60.7", = ["control"] }
extern crate spider;
use tokio;
use Website;
async
Scrape/Gather HTML
extern crate spider;
use tokio;
use Website;
async
Cron Jobs
Use cron jobs to run crawls continuously at anytime.
[]
= { = "1.60.7", = ["sync", "cron"] }
extern crate spider;
use ;
use tokio;
async
Chrome
Connecting to Chrome can be done using the ENV variable CHROME_URL, if no connection is found a new browser is launched on the system. You do not need a chrome installation if you are connecting remotely.
[]
= { = "1.60.7", = ["chrome"] }
You can use website.crawl_concurrent_raw to perform a crawl without chromium when needed. Use the feature flag chrome_headed to enable headful browser usage if needed to debug.
Smart Mode
Intelligently run crawls using HTTP and JavaScript Rendering when needed. The best of both worlds to maintain speed and extract every page. This requires a chrome connection or browser installed on the system.
[]
= { = "1.60.7", = ["smart"] }
extern crate spider;
use Website;
use tokio;
async
Blocking
If you need a blocking sync implementation use a version prior to v1.12.0.