robotxt 0.4.2

The implementation of the Robots.txt (or URL exclusion) protocol with the support of crawl-delay, sitemap and universal match extensions.
Documentation

robotxt

Build Status Crate Docs Crate Version Crate Coverage

Also check out other xwde projects here.

The implementation of the robots.txt (or URL exclusion) protocol in the Rust programming language with the support of crawl-delay, sitemap and universal * match extensions (according to the RFC specification).

Features

  • builder to enable robotxt::{RobotsBuilder, GroupBuilder}. Enabled by default.
  • parser to enable robotxt::{Robots}. Enabled by default.
  • optimal to enable overlapping rules eviction & global rule optimizations (longer parsing, but potentially faster matching).

Examples

  • parse the most specific user-agent in the provided robots.txt file:
use robotxt::Robots;

fn main() {
    let txt = r#"
      User-Agent: foobot
      Disallow: *
      Allow: /example/
      Disallow: /example/nope.txt
    "#.as_bytes();
    
    let r = Robots::from_bytes(txt, "foobot");
    assert!(r.is_allowed("/example/yeah.txt"));
    assert!(!r.is_allowed("/example/nope.txt"));
    assert!(!r.is_allowed("/invalid/path.txt"));
}
  • build the new robots.txt file in a declarative manner:
use robotxt::RobotsBuilder;

fn main() {
    let txt = RobotsBuilder::default()
        .header("Robots.txt: Start")
        .group(["foobot"], |u| {
            u.crawl_delay(5)
                .header("Rules for Foobot: Start")
                .allow("/example/yeah.txt")
                .disallow("/example/nope.txt")
                .footer("Rules for Foobot: End")
        })
        .group(["barbot", "nombot"], |u| {
            u.crawl_delay(2)
                .disallow("/example/yeah.txt")
                .disallow("/example/nope.txt")
        })
        .sitemap("https://example.com/sitemap_1.xml".try_into().unwrap())
        .sitemap("https://example.com/sitemap_1.xml".try_into().unwrap())
        .footer("Robots.txt: End");

    println!("{}", txt.to_string());
}

Links

Notes