Expand description
voyager
With voyager you can easily extract structured data from websites.
Write your own crawler/crawler with Voyager following a state machine model.
§Example
/// Declare your scraper, with all the selectors etc.
struct HackernewsScraper {
post_selector: Selector,
author_selector: Selector,
title_selector: Selector,
comment_selector: Selector,
max_page: usize,
}
/// The state model
#[derive(Debug)]
enum HackernewsState {
Page(usize),
Post,
}
/// The ouput the scraper should eventually produce
#[derive(Debug)]
struct Entry {
author: String,
url: Url,
link: Option<String>,
title: String,
}
§Implement the voyager::Scraper
trait
A Scraper
consists of two associated types:
Output
, the type the scraper eventually producesState
, the type, the scraper can drag along several requests that eventually lead to anOutput
and the scrape
callback, which is invoked after each received response.
Based on the state attached to response
you can supply the crawler with
new urls to visit with, or without a state attached to it.
Scraping is done with causal-agent/scraper.
ⓘ
impl Scraper for HackernewsScraper {
type Output = Entry;
type State = HackernewsState;
/// do your scraping
fn scrape(
&mut self,
response: Response<Self::State>,
crawler: &mut Crawler<Self>,
) -> Result<Option<Self::Output>> {
let html = response.html();
if let Some(state) = response.state {
match state {
HackernewsState::Page(page) => {
// find all entries
for id in html
.select(&self.post_selector)
.filter_map(|el| el.value().attr("id"))
{
// submit an url to a post
crawler.visit_with_state(
&format!("https://news.ycombinator.com/item?id={}", id),
HackernewsState::Post,
);
}
if page < self.max_page {
// queue in next page
crawler.visit_with_state(
&format!("https://news.ycombinator.com/news?p={}", page + 1),
HackernewsState::Page(page + 1),
);
}
}
HackernewsState::Post => {
// scrape the entry
let entry = Entry {
// ...
};
return Ok(Some(entry));
}
}
}
Ok(None)
}
}
§Setup and collect all the output
Configure the crawler with via CrawlerConfig
:
- Allow/Block list of URLs
- Delays between requests
- Whether to respect the
Robots.txt
rules
Feed your config and an instance of your scraper to the Collector
that
drives the Crawler
and forwards the responses to your Scraper
.
ⓘ
// only fulfill requests to `news.ycombinator.com`
let config = CrawlerConfig::default().allow_domain_with_delay(
"news.ycombinator.com",
// add a delay between requests
RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),
);
let mut collector = Collector::new(HackernewsScraper::default(), config);
collector.crawler_mut().visit_with_state(
"https://news.ycombinator.com/news",
HackernewsState::Page(1),
);
while let Some(output) = collector.next().await {
let post = output?;
dbg!(post);
}
Re-exports§
Modules§
Structs§
- Collector controls the
Crawler
and forwards the successful requests to theScraper
. and reports theScraper
’sOutput
back to the user. - The crawler that is responsible for driving the requests to completion and providing the crawl response for the
Scraper
. - Configure a
Collector
and itsCrawler
- Stats about sent requests and received responses
Enums§
- How to delay a request
Traits§
- A trait that is takes in successfully fetched responses, scrapes the valuable content from the responses html document and provides the with additional requests to visit and drive the scraper’s model completion.