robotxt 0.6.0

The implementation of the Robots.txt (or URL exclusion) protocol with the support of crawl-delay, sitemap and universal match extensions.
Documentation
robotxt-0.6.0 has been yanked.

robotxt

Build Status Crate Docs Crate Version Crate Coverage

Also check out other spire-rs projects here.

The implementation of the robots.txt (or URL exclusion) protocol in the Rust programming language with the support of crawl-delay, sitemap and universal * match extensions (according to the RFC specification).

Features

  • builder to enable robotxt::{RobotsBuilder, GroupBuilder}. This feature is enabled by default.
  • parser to enable robotxt::{Robots}. This feature is enabled by default.
  • optimal to enable overlapping rule eviction and global rule optimizations (this may result in longer parsing times but potentially faster matching).
  • serde to enable a custom serde::{Deserialize, Serialize} implementation, allowing for the caching of related rules.

Examples

  • parse the most specific user-agent in the provided robots.txt file:
use robotxt::Robots;

fn main() {
    let txt = r#"
      User-Agent: foobot
      Disallow: *
      Allow: /example/
      Disallow: /example/nope.txt
    "#.as_bytes();

    let r = Robots::from_bytes(txt, "foobot");
    assert!(r.is_relative_allowed("/example/yeah.txt"));
    assert!(!r.is_relative_allowed("/example/nope.txt"));
    assert!(!r.is_relative_allowed("/invalid/path.txt"));
}
  • build the new robots.txt file in a declarative manner:
use robotxt::RobotsBuilder;

fn main() -> Result<(), url::ParseError> {
    let txt = RobotsBuilder::default()
        .header("Robots.txt: Start")
        .group(["foobot"], |u| {
            u.crawl_delay(5)
                .header("Rules for Foobot: Start")
                .allow("/example/yeah.txt")
                .disallow("/example/nope.txt")
                .footer("Rules for Foobot: End")
        })
        .group(["barbot", "nombot"], |u| {
            u.crawl_delay(2)
                .disallow("/example/yeah.txt")
                .disallow("/example/nope.txt")
        })
        .sitemap("https://example.com/sitemap_1.xml".try_into()?)
        .sitemap("https://example.com/sitemap_1.xml".try_into()?)
        .footer("Robots.txt: End");

    println!("{}", txt.to_string());
    Ok(())
}

Links

Notes