pub struct RobotsTxtManager { /* private fields */ }Expand description
Fetches, parses, and caches robots.txt rules per domain.
You do not construct this directly; the CrawlerEngine
creates one when robots_txt_obey is enabled. The manager maintains an
internal cache so that each domain’s robots.txt is fetched at most once
per crawl run.
Implementations§
Source§impl RobotsTxtManager
impl RobotsTxtManager
Sourcepub fn new() -> Self
pub fn new() -> Self
Creates a new robots.txt manager with an empty cache. Rules will be fetched on demand the first time a URL on a new domain is checked.
Sourcepub async fn can_fetch(
&mut self,
url: &str,
sid: &str,
session_manager: &SessionManager,
) -> bool
pub async fn can_fetch( &mut self, url: &str, sid: &str, session_manager: &SessionManager, ) -> bool
Returns true if the URL is allowed by the domain’s robots.txt rules.
The method extracts the domain from the URL, fetches and parses
robots.txt if it hasn’t been cached yet, and checks whether the URL’s
path matches any Disallow directive. URLs that cannot be parsed (e.g.,
invalid format) are considered allowed.
Sourcepub async fn get_crawl_delay(
&mut self,
url: &str,
sid: &str,
session_manager: &SessionManager,
) -> Option<f64>
pub async fn get_crawl_delay( &mut self, url: &str, sid: &str, session_manager: &SessionManager, ) -> Option<f64>
Returns the crawl-delay specified in the domain’s robots.txt, if any.
The delay is in seconds and comes from the Crawl-delay directive under
User-agent: *. Returns None if no delay is specified or if the
domain’s robots.txt could not be fetched.
Sourcepub async fn prefetch(
&mut self,
urls: &[String],
sid: &str,
session_manager: &SessionManager,
)
pub async fn prefetch( &mut self, urls: &[String], sid: &str, session_manager: &SessionManager, )
Pre-fetches robots.txt for all unique domains in the given URLs.
The engine calls this before the crawl loop starts so that the first batch of requests is not delayed by on-demand robots.txt lookups. Duplicate domains are deduplicated internally.
Trait Implementations§
Auto Trait Implementations§
impl Freeze for RobotsTxtManager
impl RefUnwindSafe for RobotsTxtManager
impl Send for RobotsTxtManager
impl Sync for RobotsTxtManager
impl Unpin for RobotsTxtManager
impl UnsafeUnpin for RobotsTxtManager
impl UnwindSafe for RobotsTxtManager
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more