Expand description
§robotxt
Also check out other spire-rs projects
here.
The implementation of the robots.txt (or URL exclusion) protocol in the Rust
programming language with the support of crawl-delay, sitemap and universal
* match extensions (according to the RFC specification).
§Features
parserto enablerobotxt::{Robots}. Enabled by default.builderto enablerobotxt::{RobotsBuilder, GroupBuilder}. Enabled by default.optimalto optimize overlapping and global rules, potentially improving matching speed at the cost of longer parsing times.serdeto enableserde::{Deserialize, Serialize}implementation, allowing the caching of related rules.
§Examples
- parse the most specific
user-agentin the providedrobots.txtfile:
use robotxt::Robots;
fn main() {
let txt = r#"
User-Agent: foobot
Disallow: *
Allow: /example/
Disallow: /example/nope.txt
"#;
let r = Robots::from_bytes(txt.as_bytes(), "foobot");
assert!(r.is_relative_allowed("/example/yeah.txt"));
assert!(!r.is_relative_allowed("/example/nope.txt"));
assert!(!r.is_relative_allowed("/invalid/path.txt"));
}- build the new
robots.txtfile in a declarative manner:
use robotxt::RobotsBuilder;
fn main() -> Result<(), url::ParseError> {
let txt = RobotsBuilder::default()
.header("Robots.txt: Start")
.group(["foobot"], |u| {
u.crawl_delay(5)
.header("Rules for Foobot: Start")
.allow("/example/yeah.txt")
.disallow("/example/nope.txt")
.footer("Rules for Foobot: End")
})
.group(["barbot", "nombot"], |u| {
u.crawl_delay(2)
.disallow("/example/yeah.txt")
.disallow("/example/nope.txt")
})
.sitemap("https://example.com/sitemap_1.xml".try_into()?)
.sitemap("https://example.com/sitemap_1.xml".try_into()?)
.footer("Robots.txt: End");
println!("{}", txt.to_string());
Ok(())
}§Links
- Request for Comments: 9309 on RFC-Editor.com
- Introduction to Robots.txt on Google.com
- How Google interprets Robots.txt on Google.com
- What is Robots.txt file on Moz.com
§Notes
- The parser is based on Smerity/texting_robots.
- The
Hostdirective is not supported.
Re-exports§
pub use url;
Structs§
- Group
Builder builder - The single formatted
user-agentgroup. - Robots
parser - The set of directives related to the specific
user-agentin the providedrobots.txtfile. - Robots
Builder builder - The set of formatted
user-agentgroups that can be written in therobots.txtcompliant format.
Enums§
- Access
Result parser - The result of the
robots.txtretrieval attempt. - Error
- Unrecoverable failure during
robots.txtbuilding or parsing.
Constants§
- ALL_UAS
parser - All user agents group, used as a default for user-agents that don’t have an explicitly defined matching group.
- BYTE_
LIMIT - Google currently enforces a
robots.txtfile size limit of 500 kibibytes (KiB). See How Google interprets Robots.txt.
Functions§
- create_
url - Returns the expected path to the
robots.txtfile as theurl::Url.