RobotsChecker

Struct RobotsChecker 

Source
pub struct RobotsChecker { /* private fields */ }
Expand description

Main robots.txt checker with caching and fetching.

This is the primary entry point for checking URLs against robots.txt. It combines fetching, caching, parsing, and matching into a single API.

§Example

use halldyll_robots::{RobotsChecker, RobotsConfig};
use url::Url;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RobotsConfig::default();
    let checker = RobotsChecker::new(config);
     
    let url = Url::parse("https://example.com/some/path")?;
    let decision = checker.is_allowed(&url).await;
     
    if decision.allowed {
        println!("URL is allowed");
    } else {
        println!("URL is blocked: {:?}", decision.reason);
    }
     
    Ok(())
}

Implementations§

Source§

impl RobotsChecker

Source

pub fn new(config: RobotsConfig) -> Self

Create a new robots.txt checker with the given configuration.

Source

pub fn with_persistence(config: RobotsConfig, persist_dir: &str) -> Self

Create a checker with file-based cache persistence.

The cache will be saved to and loaded from the specified directory.

Source

pub async fn is_allowed(&self, url: &Url) -> Decision

Check if a URL is allowed by robots.txt.

This is the main method for checking crawl permissions. It will:

  1. Extract the origin (scheme + authority) from the URL
  2. Check the cache for an existing policy
  3. Fetch robots.txt if not cached
  4. Match the URL path against the rules
§Returns

A Decision with allowed status and the reason for the decision.

Source

pub async fn is_path_allowed(&self, origin: &Url, path: &str) -> Decision

Check if a path is allowed for a given origin.

Use this when you already have the origin URL and want to check multiple paths without re-fetching robots.txt.

Source

pub async fn get_policy(&self, url: &Url) -> RobotsPolicy

Get the robots.txt policy for a URL (cached or fetched).

This fetches and parses the robots.txt if not already cached, and returns the parsed policy. Uses conditional GET (If-None-Match, If-Modified-Since) for cache revalidation to save bandwidth.

Source

pub async fn get_crawl_delay(&self, url: &Url) -> Option<Duration>

Get crawl delay for a URL.

Returns the effective delay between requests, using crawl-delay if set, otherwise request-rate, otherwise None.

Source

pub async fn get_request_rate(&self, url: &Url) -> Option<RequestRate>

Get request rate for a URL.

Returns the request-rate directive if specified in robots.txt.

Source

pub async fn get_sitemaps(&self, url: &Url) -> Vec<String>

Get sitemaps listed in robots.txt.

Returns all Sitemap directives found in the robots.txt.

Source

pub async fn get_effective_rules(&self, url: &Url) -> EffectiveRules

Get effective rules for a URL.

Returns the merged rules that apply to this user-agent.

Source

pub async fn diagnose(&self, url: &Url) -> RobotsDiagnostics

Get detailed diagnostics for a URL check.

Useful for debugging why a URL was allowed or blocked.

Source

pub fn cache_stats(&self) -> CacheStatsSnapshot

Get cache statistics.

Source

pub fn fetch_stats(&self) -> FetchStatsSnapshot

Get fetch statistics.

Source

pub fn clear_cache(&self)

Clear all cached policies.

Source

pub fn evict_expired(&self) -> usize

Evict expired cache entries.

Returns the number of entries evicted.

Source

pub async fn save_cache(&self) -> Result<usize>

Save cache to disk (if persistence enabled).

Source

pub async fn load_cache(&self) -> Result<usize>

Load cache from disk (if persistence enabled).

Source

pub fn cached_domains(&self) -> Vec<String>

Get list of cached domains.

Source

pub fn config(&self) -> &RobotsConfig

Get the configuration.

Source

pub async fn prefetch(&self, urls: &[Url])

Prefetch robots.txt for a list of URLs.

This is useful for warming the cache before crawling.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more