Crate netiquette

Expand description

Polite behavior for web crawlers.

A web crawler can use this crate’s Limiter to honor robots.txt files on servers. This helps your spider be a good Internet citizen and avoid making a nuisance of itself and getting rate-limited.

§Usage

use netiquette::Limiter;

// Create a reqwest::Client to fetch web content.
let client = reqwest::Client::builder()
    .user_agent(MY_USER_AGENT)
    .build()
    .unwrap();

let limiter = Limiter::new(client.clone(), MY_USER_AGENT.to_string());
for url in urls {
    match limiter.acquire(&url).await {
        Ok(_permit) => handle_web_page(client.get(url).send().await),
        Err(err) => eprintln!("can't crawl {url} - {err}"),
    }
}

Of course, in a real spider, many tasks can fetch and process web pages concurrently. There can be thousands of HTTP requests in flight at a time. The purpose of Limiter is to slow down requests that would hit the same host concurrently or too frequently.

Structs§

Error: The error type used by Limiter::acquire.
Limiter: Rate-limiter for web crawlers.
Permit: Permit to fetch a URL, returned by Limiter::acquire.
Url: A parsed URL record.

Type Aliases§

Result: A result alias where the error type is netiquette::Error.