[−][src]Crate voyager
voyager
With voyager you can easily extract structured data from websites.
Write your own crawler/crawler with Voyager following a state machine model.
Example
/// Declare your scraper, with all the selectors etc. struct HackernewsScraper { post_selector: Selector, author_selector: Selector, title_selector: Selector, comment_selector: Selector, max_page: usize, } /// The state model #[derive(Debug)] enum HackernewsState { Page(usize), Post, } /// The ouput the scraper should eventually produce #[derive(Debug)] struct Entry { author: String, url: Url, link: Option<String>, title: String, }
Implement the voyager::Scraper
trait
A Scraper
consists of two associated types:
Output
, the type the scraper eventually producesState
, the type, the scraper can drag along several requests that eventually lead to anOutput
and the scrape
callback, which is invoked after each received response.
Based on the state attached to response
you can supply the crawler with
new urls to visit with, or without a state attached to it.
Scraping is done with causal-agent/scraper.
impl Scraper for HackernewsScraper { type Output = Entry; type State = HackernewsState; /// do your scraping fn scrape( &mut self, response: Response<Self::State>, crawler: &mut Crawler<Self>, ) -> Result<Option<Self::Output>> { let html = response.html(); if let Some(state) = response.state { match state { HackernewsState::Page(page) => { // find all entries for id in html .select(&self.post_selector) .filter_map(|el| el.value().attr("id")) { // submit an url to a post crawler.visit_with_state( &format!("https://news.ycombinator.com/item?id={}", id), HackernewsState::Post, ); } if page < self.max_page { // queue in next page crawler.visit_with_state( &format!("https://news.ycombinator.com/news?p={}", page + 1), HackernewsState::Page(page + 1), ); } } HackernewsState::Post => { // scrape the entry let entry = Entry { // ... }; return Ok(Some(entry)); } } } Ok(None) } }
Setup and collect all the output
Configure the crawler with via CrawlerConfig
:
- Allow/Block list of URLs
- Delays between requests
- Whether to respect the
Robots.txt
rules
Feed your config and an instance of your scraper to the Collector
that
drives the Crawler
and forwards the responses to your Scraper
.
// only fulfill requests to `news.ycombinator.com` let config = CrawlerConfig::default().allow_domain_with_delay( "news.ycombinator.com", // add a delay between requests RequestDelay::Fixed(std::time::Duration::from_millis(2_000)), ); let mut collector = Collector::new(HackernewsScraper::default(), config); collector.crawler_mut().visit_with_state( "https://news.ycombinator.com/news", HackernewsState::Page(1), ); while let Some(output) = collector.next().await { let post = output?; dbg!(post); }
Re-exports
pub use crate::requests::RequestDelay; |
pub use crate::response::Response; |
pub use scraper; |
Modules
domain | |
error | |
requests | |
response | |
robots |
Structs
Collector | Collector controls the |
Crawler | The crawler that is responsible for driving the requests to completion and
providing the crawl response for the |
CrawlerConfig | Configure a |
Stats | Stats about sent requests and received responses |
Traits
Scraper | A trait that is takes in successfully fetched responses, scrapes the valuable content from the responses html document and provides the with additional requests to visit and drive the scraper's model completion. |