Expand description
§robotxt
Also check out other spire-rs projects
here.
The implementation of the robots.txt (or URL exclusion) protocol in the Rust
programming language with the support of crawl-delay, sitemap and universal
* match extensions (according to the RFC specification).
§Features
builderto enablerobotxt::{RobotsBuilder, GroupBuilder}. This feature is enabled by default.parserto enablerobotxt::{Robots}. This feature is enabled by default.optimalto enable overlapping rule eviction and global rule optimizations (this may result in longer parsing times but potentially faster matching).serdeto enable a customserde::{Deserialize, Serialize}implementation, allowing for the caching of related rules.
§Examples
- parse the most specific
user-agentin the providedrobots.txtfile:
use robotxt::Robots;
fn main() {
let txt = r#"
User-Agent: foobot
Disallow: *
Allow: /example/
Disallow: /example/nope.txt
"#.as_bytes();
let r = Robots::from_bytes(txt, "foobot");
assert!(r.is_relative_allowed("/example/yeah.txt"));
assert!(!r.is_relative_allowed("/example/nope.txt"));
assert!(!r.is_relative_allowed("/invalid/path.txt"));
}- build the new
robots.txtfile in a declarative manner:
use robotxt::RobotsBuilder;
fn main() -> Result<(), url::ParseError> {
let txt = RobotsBuilder::default()
.header("Robots.txt: Start")
.group(["foobot"], |u| {
u.crawl_delay(5)
.header("Rules for Foobot: Start")
.allow("/example/yeah.txt")
.disallow("/example/nope.txt")
.footer("Rules for Foobot: End")
})
.group(["barbot", "nombot"], |u| {
u.crawl_delay(2)
.disallow("/example/yeah.txt")
.disallow("/example/nope.txt")
})
.sitemap("https://example.com/sitemap_1.xml".try_into()?)
.sitemap("https://example.com/sitemap_1.xml".try_into()?)
.footer("Robots.txt: End");
println!("{}", txt.to_string());
Ok(())
}§Links
- Request for Comments: 9309 on RFC-Editor.com
- Introduction to Robots.txt on Google.com
- How Google interprets Robots.txt on Google.com
- What is Robots.txt file on Moz.com
§Notes
- The parser is based on Smerity/texting_robots.
- The
Hostdirective is not supported.
Re-exports§
pub use url;
Structs§
- GroupBuilder
builderThe single formatteduser-agentgroup. - Robots
parserThe set of directives related to the specificuser-agentin the providedrobots.txtfile. - RobotsBuilder
builderThe set of formatteduser-agentgroups that can be written in therobots.txtcompliant format.
Enums§
- AccessResult
parserThe result of therobots.txtretrieval attempt. - Unrecoverable failure during
robots.txtbuilding or parsing.
Constants§
- ALL_UAS
parserAll user agents group, used as a default for user-agents that don’t have an explicitly defined matching group. - Google currently enforces a
robots.txtfile size limit of 500 kibibytes (KiB). See How Google interprets Robots.txt.
Functions§
- Returns the expected path to the
robots.txtfile as theurl::Url.