crusty-core 0.2.2

Library for creating blazing fast and configurable web crawlers
Documentation

Crusty-core - build your own web crawler!

  • multi-threaded && async on top of tokio
  • highly customizable filtering at each and every step - status code/headers received, page downloaded, link filter
  • built on top of hyper (http2 and gzip/deflate baked in)
  • rich content extraction with select
  • observable with tracing and custom metrics exposed to user(stuff like html parsing duration, bytes sent/received)
  • lots of options, almost everything is configurable
  • applicable both for focused and broad crawling
  • scales with ease when you want to crawl millions/billions of domains
  • it's fast, fast, fast!

Install

Simply add this to your Cargo.toml

[dependencies]
crusty-core = "~0.2.2"

Example - crawl single website, collect information about TITLE tags

if you are in a hurry:

use crusty_core::prelude::*;

#[derive(Debug, Clone, Default)]
pub struct JobState {
    sum_title_len: usize
}

#[derive(Debug, Clone, Default)]
pub struct TaskState {
    title: String
}

pub struct DataExtractor {}
impl TaskExpander<JobState, TaskState> for DataExtractor {
    fn expand(&self, ctx: &mut JobCtx<JobState, TaskState>, _: &Task, _: &HttpStatus, doc: &Document) {
        let title = doc.find(Name("title")).next().map(|v|v.text());
        if let Some(title) = title {
            ctx.task_state.lock().unwrap().title = title.clone();
            ctx.job_state.lock().unwrap().sum_title_len += title.len();
        }
    }
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let crawler = Crawler::new_default()?;

    let settings = config::CrawlingSettings::default();
    let rules = CrawlingRules::default().with_task_expander(|| DataExtractor{} );

    let job = Job::new("https://bash.im", settings, rules, JobState::default())?;
    for r in crawler.iter(job) {
        println!("- {}, task state: {:?}", r, r.context.task_state);
        if let JobStatus::Finished(_) = r.status {
            println!("final job state: {:?}", r.context.job_state.lock().unwrap());
        }
    }
    Ok(())
}

If you want to get more fancy and configure some stuff or control your imports more precisely

use crusty_core::{
    ParserProcessor, CrawlingRules, CrawlingRulesOptions, Crawler, TaskExpander,
    types::{
        Job, JobCtx, Task, HttpStatus, JobStatus,
        select::predicate::Name, select::document::Document
    },
    config,
};

#[derive(Debug, Clone, Default)]
pub struct JobState {
    sum_title_len: usize
}

#[derive(Debug, Clone, Default)]
pub struct TaskState {
    title: String
}

pub struct DataExtractor {}
impl TaskExpander<JobState, TaskState> for DataExtractor {
    fn expand(&self, ctx: &mut JobCtx<JobState, TaskState>, _: &Task, _: &HttpStatus, doc: &Document) {
        let title = doc.find(Name("title")).next().map(|v|v.text());
        if let Some(title) = title {
            ctx.task_state.lock().unwrap().title = title.clone();
            ctx.job_state.lock().unwrap().sum_title_len += title.len();
        }
    }
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let concurrency_profile = config::ConcurrencyProfile::default();
    let pp = ParserProcessor::spawn(concurrency_profile, 1024 * 1024 * 32);

    let networking_profile = config::NetworkingProfile::default().resolve()?;
    let crawler = Crawler::new(networking_profile, &pp);

    let settings = config::CrawlingSettings::default();
    let rules_opt = CrawlingRulesOptions::default();
    let rules = CrawlingRules::new(rules_opt).with_task_expander(|| DataExtractor{} );

    let job = Job::new("https://bash.im", settings, rules, JobState::default())?;
    for r in crawler.iter(job) {
        println!("- {}, task state: {:?}", r, r.context.task_state);
        if let JobStatus::Finished(_) = r.status {
            println!("final job state: {:?}", r.context.job_state.lock().unwrap());
        }
    }

    Ok(pp.join().await??)
}

Notes

Please see examples for more complicated usage scenarios. This crawler is more verbose than some others, but it allows incredible customization at each and every step.

If you are interested in the area of broad web crawling there's crusty, developed fully on top of crusty-core that tries to tackle on some challenges of broad web crawling