Crate voyager

source · []
Expand description

voyager

With voyager you can easily extract structured data from websites.

Write your own crawler/crawler with Voyager following a state machine model.

Example

/// Declare your scraper, with all the selectors etc.
struct HackernewsScraper {
    post_selector: Selector,
    author_selector: Selector,
    title_selector: Selector,
    comment_selector: Selector,
    max_page: usize,
}

/// The state model
#[derive(Debug)]
enum HackernewsState {
    Page(usize),
    Post,
}

/// The ouput the scraper should eventually produce
#[derive(Debug)]
struct Entry {
    author: String,
    url: Url,
    link: Option<String>,
    title: String,
}

Implement the voyager::Scraper trait

A Scraper consists of two associated types:

  • Output, the type the scraper eventually produces
  • State, the type, the scraper can drag along several requests that eventually lead to an Output

and the scrape callback, which is invoked after each received response.

Based on the state attached to response you can supply the crawler with new urls to visit with, or without a state attached to it.

Scraping is done with causal-agent/scraper.

impl Scraper for HackernewsScraper {
    type Output = Entry;
    type State = HackernewsState;

    /// do your scraping
    fn scrape(
        &mut self,
        response: Response<Self::State>,
        crawler: &mut Crawler<Self>,
    ) -> Result<Option<Self::Output>> {
        let html = response.html();

        if let Some(state) = response.state {
            match state {
                HackernewsState::Page(page) => {
                    // find all entries
                    for id in html
                        .select(&self.post_selector)
                        .filter_map(|el| el.value().attr("id"))
                    {
                        // submit an url to a post
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/item?id={}", id),
                            HackernewsState::Post,
                        );
                    }
                    if page < self.max_page {
                        // queue in next page
                        crawler.visit_with_state(
                            &format!("https://news.ycombinator.com/news?p={}", page + 1),
                            HackernewsState::Page(page + 1),
                        );
                    }
                }

                HackernewsState::Post => {
                    // scrape the entry
                    let entry = Entry {
                        // ...
                    };
                    return Ok(Some(entry));
                }
            }
        }

        Ok(None)
    }
}

Setup and collect all the output

Configure the crawler with via CrawlerConfig:

  • Allow/Block list of URLs
  • Delays between requests
  • Whether to respect the Robots.txt rules

Feed your config and an instance of your scraper to the Collector that drives the Crawler and forwards the responses to your Scraper.

    // only fulfill requests to `news.ycombinator.com`
    let config = CrawlerConfig::default().allow_domain_with_delay(
        "news.ycombinator.com",
        // add a delay between requests
        RequestDelay::Fixed(std::time::Duration::from_millis(2_000)),
    );

    let mut collector = Collector::new(HackernewsScraper::default(), config);

    collector.crawler_mut().visit_with_state(
        "https://news.ycombinator.com/news",
        HackernewsState::Page(1),
    );

    while let Some(output) = collector.next().await {
        let post = output?;
        dbg!(post);
    }

Re-exports

pub use crate::response::Response;
pub use scraper;

Modules

Structs

Collector controls the Crawler and forwards the successful requests to the Scraper. and reports the Scraper’s Output back to the user.

The crawler that is responsible for driving the requests to completion and providing the crawl response for the Scraper.

Configure a Collector and its Crawler

Stats about sent requests and received responses

Enums

How to delay a request

Traits

A trait that is takes in successfully fetched responses, scrapes the valuable content from the responses html document and provides the with additional requests to visit and drive the scraper’s model completion.