RobotsChecker

halldyll_robots::checker

Struct RobotsChecker

pub struct RobotsChecker { /* private fields */ }

Expand description

Main robots.txt checker with caching and fetching.

This is the primary entry point for checking URLs against robots.txt. It combines fetching, caching, parsing, and matching into a single API.

§Example

use halldyll_robots::{RobotsChecker, RobotsConfig};
use url::Url;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RobotsConfig::default();
    let checker = RobotsChecker::new(config);
     
    let url = Url::parse("https://example.com/some/path")?;
    let decision = checker.is_allowed(&url).await;
     
    if decision.allowed {
        println!("URL is allowed");
    } else {
        println!("URL is blocked: {:?}", decision.reason);
    }
     
    Ok(())
}

Implementations§

impl RobotsChecker

pub fn new(config: RobotsConfig) -> Self

Create a new robots.txt checker with the given configuration.

pub fn with_persistence(config: RobotsConfig, persist_dir: &str) -> Self

Create a checker with file-based cache persistence.

The cache will be saved to and loaded from the specified directory.

pub async fn is_allowed(&self, url: &Url) -> Decision

Check if a URL is allowed by robots.txt.

This is the main method for checking crawl permissions. It will:

Extract the origin (scheme + authority) from the URL
Check the cache for an existing policy
Fetch robots.txt if not cached
Match the URL path against the rules

§Returns

A Decision with allowed status and the reason for the decision.

pub async fn is_path_allowed(&self, origin: &Url, path: &str) -> Decision

Check if a path is allowed for a given origin.

Use this when you already have the origin URL and want to check multiple paths without re-fetching robots.txt.

pub async fn get_policy(&self, url: &Url) -> RobotsPolicy

Get the robots.txt policy for a URL (cached or fetched).

This fetches and parses the robots.txt if not already cached, and returns the parsed policy. Uses conditional GET (If-None-Match, If-Modified-Since) for cache revalidation to save bandwidth.

pub async fn get_crawl_delay(&self, url: &Url) -> Option<Duration>

Get crawl delay for a URL.

Returns the effective delay between requests, using crawl-delay if set, otherwise request-rate, otherwise None.

pub async fn get_request_rate(&self, url: &Url) -> Option<RequestRate>

Get request rate for a URL.

Returns the request-rate directive if specified in robots.txt.

pub async fn get_sitemaps(&self, url: &Url) -> Vec<String>

Get sitemaps listed in robots.txt.

Returns all Sitemap directives found in the robots.txt.

pub async fn get_effective_rules(&self, url: &Url) -> EffectiveRules

Get effective rules for a URL.

Returns the merged rules that apply to this user-agent.

pub async fn diagnose(&self, url: &Url) -> RobotsDiagnostics

Get detailed diagnostics for a URL check.

Useful for debugging why a URL was allowed or blocked.

pub fn cache_stats(&self) -> CacheStatsSnapshot

Get cache statistics.

pub fn fetch_stats(&self) -> FetchStatsSnapshot

Get fetch statistics.

pub fn clear_cache(&self)

Clear all cached policies.

pub fn evict_expired(&self) -> usize

Evict expired cache entries.

Returns the number of entries evicted.

pub async fn save_cache(&self) -> Result<usize>

Save cache to disk (if persistence enabled).

pub async fn load_cache(&self) -> Result<usize>

Load cache from disk (if persistence enabled).

pub fn cached_domains(&self) -> Vec<String>

Get list of cached domains.

pub fn config(&self) -> &RobotsConfig

Get the configuration.

pub async fn prefetch(&self, urls: &[Url])

Prefetch robots.txt for a list of URLs.

This is useful for warming the cache before crawling.

Auto Trait Implementations§

impl Freeze for RobotsChecker

impl !RefUnwindSafe for RobotsChecker

impl Send for RobotsChecker

impl Sync for RobotsChecker

impl Unpin for RobotsChecker

impl !UnwindSafe for RobotsChecker

Blanket Implementations§

impl<T> Any for T
where T: 'static + ?Sized,

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

impl<T> Borrow<T> for T
where T: ?Sized,

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

impl<T> BorrowMut<T> for T
where T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

impl<T> From<T> for T

fn from(t: T) -> T

Returns the argument unchanged.

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

impl<T, U> Into<U> for T
where U: From<T>,

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T, U> TryFrom<U> for T
where U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more