Expand description
Website crawling library that rapidly crawls all pages to gather links via isolated contexts.
Spider is multi-threaded crawler that can be configured to scrape web pages. It has the ability to gather tens of thousands of pages within seconds.
How to use Spider
There are a couple of ways to use Spider:
- Concurrent is the fastest way to start crawling a web page and
typically the most efficient.
crawl
is used to crawl concurrently.
- Sequential lets you crawl the web pages one after another respecting delay sequences.
crawl_sync
is used to crawl in sync.
- Scrape Scrape the page and hold onto the HTML raw string to parse.
scrape
is used to gather the HTML.
Basic usage
First, you will need to add spider
to your Cargo.toml
.
Next, simply add the website url in the struct of website and crawl, you can also crawl sequentially.
Re-exports
pub extern crate compact_str;
pub extern crate hashbrown;
pub extern crate tokio;
pub extern crate url;
Modules
- Black list checking url exist.
- Configuration structure for
Website
. - Internal packages customized.
- A page scraped.
- Application utils.
- A website to crawl.