Skip to main content

RobotsTxtManager

Struct RobotsTxtManager 

Source
pub struct RobotsTxtManager { /* private fields */ }
Expand description

Fetches, parses, and caches robots.txt rules per domain.

You do not construct this directly; the CrawlerEngine creates one when robots_txt_obey is enabled. The manager maintains an internal cache so that each domain’s robots.txt is fetched at most once per crawl run.

Implementations§

Source§

impl RobotsTxtManager

Source

pub fn new() -> Self

Creates a new robots.txt manager with an empty cache. Rules will be fetched on demand the first time a URL on a new domain is checked.

Source

pub async fn can_fetch( &mut self, url: &str, sid: &str, session_manager: &SessionManager, ) -> bool

Returns true if the URL is allowed by the domain’s robots.txt rules.

The method extracts the domain from the URL, fetches and parses robots.txt if it hasn’t been cached yet, and checks whether the URL’s path matches any Disallow directive. URLs that cannot be parsed (e.g., invalid format) are considered allowed.

Source

pub async fn get_crawl_delay( &mut self, url: &str, sid: &str, session_manager: &SessionManager, ) -> Option<f64>

Returns the crawl-delay specified in the domain’s robots.txt, if any. The delay is in seconds and comes from the Crawl-delay directive under User-agent: *. Returns None if no delay is specified or if the domain’s robots.txt could not be fetched.

Source

pub async fn prefetch( &mut self, urls: &[String], sid: &str, session_manager: &SessionManager, )

Pre-fetches robots.txt for all unique domains in the given URLs.

The engine calls this before the crawl loop starts so that the first batch of requests is not delayed by on-demand robots.txt lookups. Duplicate domains are deduplicated internally.

Trait Implementations§

Source§

impl Default for RobotsTxtManager

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more