pub struct RobotsChecker { /* private fields */ }Expand description
Main robots.txt checker with caching and fetching.
This is the primary entry point for checking URLs against robots.txt. It combines fetching, caching, parsing, and matching into a single API.
§Example
use halldyll_robots::{RobotsChecker, RobotsConfig};
use url::Url;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let config = RobotsConfig::default();
let checker = RobotsChecker::new(config);
let url = Url::parse("https://example.com/some/path")?;
let decision = checker.is_allowed(&url).await;
if decision.allowed {
println!("URL is allowed");
} else {
println!("URL is blocked: {:?}", decision.reason);
}
Ok(())
}Implementations§
Source§impl RobotsChecker
impl RobotsChecker
Sourcepub fn new(config: RobotsConfig) -> Self
pub fn new(config: RobotsConfig) -> Self
Create a new robots.txt checker with the given configuration.
Sourcepub fn with_persistence(config: RobotsConfig, persist_dir: &str) -> Self
pub fn with_persistence(config: RobotsConfig, persist_dir: &str) -> Self
Create a checker with file-based cache persistence.
The cache will be saved to and loaded from the specified directory.
Sourcepub async fn is_allowed(&self, url: &Url) -> Decision
pub async fn is_allowed(&self, url: &Url) -> Decision
Check if a URL is allowed by robots.txt.
This is the main method for checking crawl permissions. It will:
- Extract the origin (scheme + authority) from the URL
- Check the cache for an existing policy
- Fetch robots.txt if not cached
- Match the URL path against the rules
§Returns
A Decision with allowed status and the reason for the decision.
Sourcepub async fn is_path_allowed(&self, origin: &Url, path: &str) -> Decision
pub async fn is_path_allowed(&self, origin: &Url, path: &str) -> Decision
Check if a path is allowed for a given origin.
Use this when you already have the origin URL and want to check multiple paths without re-fetching robots.txt.
Sourcepub async fn get_policy(&self, url: &Url) -> RobotsPolicy
pub async fn get_policy(&self, url: &Url) -> RobotsPolicy
Get the robots.txt policy for a URL (cached or fetched).
This fetches and parses the robots.txt if not already cached, and returns the parsed policy. Uses conditional GET (If-None-Match, If-Modified-Since) for cache revalidation to save bandwidth.
Sourcepub async fn get_crawl_delay(&self, url: &Url) -> Option<Duration>
pub async fn get_crawl_delay(&self, url: &Url) -> Option<Duration>
Get crawl delay for a URL.
Returns the effective delay between requests, using crawl-delay if set,
otherwise request-rate, otherwise None.
Sourcepub async fn get_request_rate(&self, url: &Url) -> Option<RequestRate>
pub async fn get_request_rate(&self, url: &Url) -> Option<RequestRate>
Get request rate for a URL.
Returns the request-rate directive if specified in robots.txt.
Sourcepub async fn get_sitemaps(&self, url: &Url) -> Vec<String>
pub async fn get_sitemaps(&self, url: &Url) -> Vec<String>
Get sitemaps listed in robots.txt.
Returns all Sitemap directives found in the robots.txt.
Sourcepub async fn get_effective_rules(&self, url: &Url) -> EffectiveRules
pub async fn get_effective_rules(&self, url: &Url) -> EffectiveRules
Get effective rules for a URL.
Returns the merged rules that apply to this user-agent.
Sourcepub async fn diagnose(&self, url: &Url) -> RobotsDiagnostics
pub async fn diagnose(&self, url: &Url) -> RobotsDiagnostics
Get detailed diagnostics for a URL check.
Useful for debugging why a URL was allowed or blocked.
Sourcepub fn cache_stats(&self) -> CacheStatsSnapshot
pub fn cache_stats(&self) -> CacheStatsSnapshot
Get cache statistics.
Sourcepub fn fetch_stats(&self) -> FetchStatsSnapshot
pub fn fetch_stats(&self) -> FetchStatsSnapshot
Get fetch statistics.
Sourcepub fn clear_cache(&self)
pub fn clear_cache(&self)
Clear all cached policies.
Sourcepub fn evict_expired(&self) -> usize
pub fn evict_expired(&self) -> usize
Evict expired cache entries.
Returns the number of entries evicted.
Sourcepub async fn save_cache(&self) -> Result<usize>
pub async fn save_cache(&self) -> Result<usize>
Save cache to disk (if persistence enabled).
Sourcepub async fn load_cache(&self) -> Result<usize>
pub async fn load_cache(&self) -> Result<usize>
Load cache from disk (if persistence enabled).
Sourcepub fn cached_domains(&self) -> Vec<String>
pub fn cached_domains(&self) -> Vec<String>
Get list of cached domains.
Sourcepub fn config(&self) -> &RobotsConfig
pub fn config(&self) -> &RobotsConfig
Get the configuration.