Expand description
Polite behavior for web crawlers.
A web crawler can use this crate’s Limiter to honor robots.txt files on servers. This
helps your spider be a good Internet citizen and avoid making a nuisance of itself and getting
rate-limited.
§Usage
use netiquette::Limiter;
// Create a reqwest::Client to fetch web content.
let client = reqwest::Client::builder()
.user_agent(MY_USER_AGENT)
.build()
.unwrap();
let limiter = Limiter::new(client.clone(), MY_USER_AGENT.to_string());
for url in urls {
match limiter.acquire(&url).await {
Ok(_permit) => handle_web_page(client.get(url).send().await),
Err(err) => eprintln!("can't crawl {url} - {err}"),
}
}Of course, in a real spider, many tasks can fetch and process web pages concurrently. There can
be thousands of HTTP requests in flight at a time. The purpose of Limiter is to slow down
requests that would hit the same host concurrently or too frequently.
Structs§
- Error
- The error type used by
Limiter::acquire. - Limiter
- Rate-limiter for web crawlers.
- Permit
- Permit to fetch a URL, returned by
Limiter::acquire. - Url
- A parsed URL record.
Type Aliases§
- Result
- A result alias where the error type is
netiquette::Error.