halldyll-robots 0.1.0

robots.txt parser and compliance for halldyll scraper
Documentation

halldyll-robots

RFC 9309 compliant robots.txt parser and checker.

Features

  • RFC 9309 Compliance: Full support for the robots.txt standard
  • Unavailable vs Unreachable: Proper handling per RFC (4xx = allow, 5xx = deny)
  • Safe Mode: Optional stricter handling of 401/403 as deny
  • Conditional GET: ETag/Last-Modified support for bandwidth savings
  • Request-rate: Non-standard but common directive support
  • Caching: In-memory cache with optional file persistence
  • Pattern Matching: Wildcards (*), end anchors ($), percent-encoding
  • UTF-8 BOM: Automatic stripping of BOM prefix
  • Observability: Detailed logging and statistics with min/max/avg metrics

Example

use halldyll_robots::{RobotsChecker, RobotsConfig};
use url::Url;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = RobotsConfig::default();
    let checker = RobotsChecker::new(config);
    
    let url = Url::parse("https://example.com/some/path")?;
    let decision = checker.is_allowed(&url).await;
    
    if decision.allowed {
        println!("URL is allowed");
    } else {
        println!("URL is blocked: {:?}", decision.reason);
    }
    
    Ok(())
}