pub struct RobotFileParser {
pub disallow_all: bool,
pub allow_all: bool,
pub last_checked: i64,
pub disallow_paths_regex: RegexSet,
pub disallow_paths: HashSet<String>,
pub disallow_agents_regex: RegexSet,
pub wild_card_agent: bool,
pub disallow_agents: HashSet<String>,
/* private fields */
}Expand description
robots.txt file parser
Fields§
§disallow_all: boolDis-allow links reguardless of robots.txt
allow_all: boolAllow links reguardless of robots.txt
last_checked: i64Time last checked robots.txt file
disallow_paths_regex: RegexSetregex only.Disallow list of regex paths to ignore.
disallow_paths: HashSet<String>regex only.Disallow list of paths to ignore.
disallow_agents_regex: RegexSetregex only.Disallow list of regex agents to ignore.
wild_card_agent: boolregex only.Wild card agent provided.
disallow_agents: HashSet<String>regex only.Disallow list of agents to ignore.
Implementations§
Source§impl RobotFileParser
impl RobotFileParser
Sourcepub fn new() -> Box<RobotFileParser>
Available on crate feature regex only.
pub fn new() -> Box<RobotFileParser>
regex only.Establish a new robotparser for a website domain
Sourcepub fn mtime(&self) -> i64
pub fn mtime(&self) -> i64
Returns the time the robots.txt file was last fetched.
This is useful for long-running web spiders that need to check for new robots.txt files periodically.
Sourcepub fn modified(&mut self)
pub fn modified(&mut self)
Sets the time the robots.txt file was last fetched to the current time.
Sourcepub fn get_entries(&self) -> &Vec<Entry>
pub fn get_entries(&self) -> &Vec<Entry>
Get the entries inserted.
Sourcepub fn get_base_entry(&self) -> &Entry
pub fn get_base_entry(&self) -> &Entry
Get the base entry inserted.
Sourcepub async fn read(&mut self, client: &Client, url: &str)
pub async fn read(&mut self, client: &Client, url: &str)
Reads the robots.txt URL and feeds it to the parser.
Sourcepub async fn from_response(&mut self, response: Response)
pub async fn from_response(&mut self, response: Response)
Reads the HTTP response and feeds it to the parser.
Sourcepub fn parse_str(&mut self, text: &str)
pub fn parse_str(&mut self, text: &str)
Parse a robots.txt string directly, splitting lines via memchr
to avoid allocating an intermediate Vec<&str>.
Sourcepub fn parse<T: AsRef<str>>(&mut self, lines: &[T])
pub fn parse<T: AsRef<str>>(&mut self, lines: &[T])
Parse the input lines from a robots.txt file
We allow that a user-agent: line is not preceded by one or more blank lines.
Sourcepub fn set_disallow_list(&mut self, path: &str)
Available on crate feature regex only.
pub fn set_disallow_list(&mut self, path: &str)
regex only.Include the disallow paths in the regex set. This does nothing without the ‘regex’ feature.
Sourcepub fn set_disallow_agents_list(&mut self, agent: &str)
Available on crate feature regex only.
pub fn set_disallow_agents_list(&mut self, agent: &str)
regex only.Include the disallow agents in the regex set. This does nothing without the ‘regex’ feature.
Sourcepub fn build_disallow_list(&mut self)
Available on crate feature regex only.
pub fn build_disallow_list(&mut self)
regex only.Build the regex disallow list. This does nothing without the ‘regex’ feature.
Sourcepub fn can_fetch<T: AsRef<str>>(&self, useragent: T, url: &str) -> bool
pub fn can_fetch<T: AsRef<str>>(&self, useragent: T, url: &str) -> bool
Using the parsed robots.txt decide if useragent can fetch url
Sourcepub fn entry_allowed<T: AsRef<str>>(&self, useragent: &T, url_str: &str) -> bool
Available on crate feature regex only.
pub fn entry_allowed<T: AsRef<str>>(&self, useragent: &T, url_str: &str) -> bool
regex only.Is the entry apply to the robots.txt?
Sourcepub fn get_crawl_delay(
&self,
useragent: &Option<Box<CompactString>>,
) -> Option<Duration>
pub fn get_crawl_delay( &self, useragent: &Option<Box<CompactString>>, ) -> Option<Duration>
Returns the crawl delay for this user agent as a Duration, or None if no crawl delay is defined.
Sourcepub fn get_req_rate<T: AsRef<str>>(&self, useragent: T) -> Option<RequestRate>
pub fn get_req_rate<T: AsRef<str>>(&self, useragent: T) -> Option<RequestRate>
Returns the request rate for this user agent as a RequestRate, or None if not request rate is defined
Trait Implementations§
Source§impl Clone for RobotFileParser
impl Clone for RobotFileParser
Source§fn clone(&self) -> RobotFileParser
fn clone(&self) -> RobotFileParser
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreAuto Trait Implementations§
impl Freeze for RobotFileParser
impl RefUnwindSafe for RobotFileParser
impl Send for RobotFileParser
impl Sync for RobotFileParser
impl Unpin for RobotFileParser
impl UnsafeUnpin for RobotFileParser
impl UnwindSafe for RobotFileParser
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more