Expand description
§robotxt
Also check out other spire-rs projects
here.
The implementation of the robots.txt (or URL exclusion) protocol in the Rust
programming language with the support of crawl-delay, sitemap and universal
* match extensions (according to the RFC specification).
§Features
- parserto enable- robotxt::{Robots}. Enabled by default.
- builderto enable- robotxt::{RobotsBuilder, GroupBuilder}. Enabled by default.
- optimalto optimize overlapping and global rules, potentially improving matching speed at the cost of longer parsing times.
- serdeto enable- serde::{Deserialize, Serialize}implementation, allowing the caching of related rules.
§Examples
- parse the most specific user-agentin the providedrobots.txtfile:
use robotxt::Robots;
fn main() {
    let txt = r#"
      User-Agent: foobot
      Disallow: *
      Allow: /example/
      Disallow: /example/nope.txt
    "#;
    let r = Robots::from_bytes(txt.as_bytes(), "foobot");
    assert!(r.is_relative_allowed("/example/yeah.txt"));
    assert!(!r.is_relative_allowed("/example/nope.txt"));
    assert!(!r.is_relative_allowed("/invalid/path.txt"));
}- build the new robots.txtfile in a declarative manner:
use robotxt::RobotsBuilder;
fn main() -> Result<(), url::ParseError> {
    let txt = RobotsBuilder::default()
        .header("Robots.txt: Start")
        .group(["foobot"], |u| {
            u.crawl_delay(5)
                .header("Rules for Foobot: Start")
                .allow("/example/yeah.txt")
                .disallow("/example/nope.txt")
                .footer("Rules for Foobot: End")
        })
        .group(["barbot", "nombot"], |u| {
            u.crawl_delay(2)
                .disallow("/example/yeah.txt")
                .disallow("/example/nope.txt")
        })
        .sitemap("https://example.com/sitemap_1.xml".try_into()?)
        .sitemap("https://example.com/sitemap_1.xml".try_into()?)
        .footer("Robots.txt: End");
    println!("{}", txt.to_string());
    Ok(())
}§Links
- Request for Comments: 9309 on RFC-Editor.com
- Introduction to Robots.txt on Google.com
- How Google interprets Robots.txt on Google.com
- What is Robots.txt file on Moz.com
§Notes
- The parser is based on Smerity/texting_robots.
- The Hostdirective is not supported.
Re-exports§
- pub use url;
Structs§
- GroupBuilder builder
- The single formatted user-agentgroup.
- Robotsparser
- The set of directives related to the specific user-agentin the providedrobots.txtfile.
- RobotsBuilder builder
- The set of formatted user-agentgroups that can be written in therobots.txtcompliant format.
Enums§
- AccessResult parser
- The result of the robots.txtretrieval attempt.
- Error
- Unrecoverable failure during robots.txtbuilding or parsing.
Constants§
- ALL_UASparser
- All user agents group, used as a default for user-agents that don’t have an explicitly defined matching group.
- BYTE_LIMIT 
- Google currently enforces a robots.txtfile size limit of 500 kibibytes (KiB). See How Google interprets Robots.txt.
Functions§
- create_url 
- Returns the expected path to the robots.txtfile as theurl::Url.