Crate scrappy_do[][src]

Expand description

A crate to asynchronously crawl web pages. This crate was largely inspired by the python module scrapy, but updated to take advantage of Rust’s safety, performance, and expressiveness.

About

This crate is built around the idea of transforming HTTP responses into scraped Items through dynamically linked handlers. The caller defines the handlers and the crate takes care of the scheduling.

Basic Example

The following example shows the most basic handler chain possible:

// Needed for the yield keyword to work
#![feature(generators)]

use futures::stream::StreamExt; // Provides friendly methods for streams
use reqwest::{Client, Response};
use scrappy_do::{handle, wrap};
use url::Url;

// This is what we are trying to create from the web pages
#[derive(Debug)]
struct SomeItem;

// This tracks metadata we find useful during scraping.
#[derive(Debug)]
struct SomeContext;

#[handle(item = SomeItem)]
fn handler_foo(
    client: Client,
    response: Response,
    context: SomeContext,
) {

    // Process the response....

    // Return some scraped item
    yield SomeItem;
}

// Sets up multithreaded runtime
#[tokio::main]
async fn main() {
    // Build the HTTP client
    let client = reqwest::Client::builder()
        .build()
        .unwrap();

    // Build the spider
    let spider = scrappy_do::Spider::new(client.clone());

    let items = spider
        // A web requires an initial address, handler, and context in order to be created. All
        // other configuration is optional.
        .web()
        .start(
            client
                .get(Url::parse("http://somedomain.toscrape.com").unwrap())
                .build()
                .unwrap(),
        )
        .handler(wrap!(handler_foo))
        .context(SomeContext)
        .build()
        .crawl()
        .await;

    // The stream must be pinned to the stack to iterate over it
    tokio::pin!(items);

    // Process the items scraped.
    while let Some(item) = items.next().await {
        println!("{:?}", item);
    }
}

Chaining Example

This example shows a slightly more complicated chain where a handler invokes another handler. The important thing to note is that the caller doesn’t change the inital chain invocation at all.

// Needed for the yield keyword to work
#![feature(generators)]

use futures::stream::StreamExt; // Provides friendly methods for streams
use reqwest::{Client, Response};
use scrappy_do::{handle, wrap};
use url::Url;

// This is what we are trying to create from the web pages
#[derive(Debug)]
struct SomeItem;

// This tracks metadata we find useful during scraping.
#[derive(Debug)]
struct SomeContext;

// The inital handler
#[handle(item = SomeItem)]
fn handler_foo(
    client: Client,
    response: Response,
    context: SomeContext,
) {

    // Process the response....

    let callback = scrappy_do::Callback::new(
               wrap!(handler_bar),
               client
                   .get(response.url().clone())
                   .build()
                   .unwrap(),
               context,
           );

    // Chain the next callback
    yield callback;

    // Return some scraped item
    yield SomeItem;
}

// Called by handler_foo
#[handle(item = SomeItem)]
fn handler_bar(
    client: Client,
    response: Response,
    context: SomeContext,
) {

    // Process the response....

    // Return some scraped item
    yield SomeItem;
}

// Sets up multithreaded runtime
#[tokio::main]
async fn main() {
    // Build the HTTP client
    let client = reqwest::Client::builder()
        .build()
        .unwrap();

    // Build the spider
    let spider = scrappy_do::Spider::new(client.clone());

    let items = spider
        // A web requires an initial address, handler, and context in order to be created. All
        // other configuration is optional.
        .web()
        .start(
            client
                .get(Url::parse("http://somedomain.toscrape.com").unwrap())
                .build()
                .unwrap(),
        )
        .handler(wrap!(handler_foo))
        .context(SomeContext)
        .build()
        .crawl()
        .await;

    // The stream must be pinned to the stack to iterate over it
    tokio::pin!(items);

    // Process the items scraped.
    while let Some(item) = items.next().await {
        println!("{:?}", item);
    }
}

Ultimately the chain can be as complicated or as simple as the caller wants. It can even be never ending if need be and the Web will just keep happily chugging along processing handlers.

Modules

Macros

Wraps a function in concrete Handler struct.

Structs

Transforms a Request into Indeterminates.

Creates webs to be used to asynchronously and concurrently crawl a webpage. Internal state is lightweight so it is unecessary to wrap in Arc or Rc. Multithreading can be achieved by using the multithreaded tokio runtime.

A Web defines how to process HTML pages.

A WebBuilder can be used to create a Web with custom behavior.

Enums

Represents the current calculation state.

Traits

Converts HTTP responses into items or HTTP requests.

Attribute Macros

Attribute to generate a handler function.