Crate robotxt

source ·
Expand description

The implementation of the robots.txt protocol (or URL exclusion protocol) with the support of crawl-delay, sitemap, and universal * match extensions (according to the RFC specification).

Also check out other xwde projects here.

Examples

  • parse the user-agent in the provided robots.txt file (See Robots):
use robotxt::Robots;

let txt = r#"
    User-Agent: foobot
    Disallow: *
    Allow: /example/
    Disallow: /example/nope.txt
"#.as_bytes();

let r = Robots::from_bytes(txt, "foobot");
assert!(r.is_allowed("/example/yeah.txt"));
assert!(!r.is_allowed("/example/nope.txt"));
assert!(!r.is_allowed("/invalid/path.txt"));
  • build the new robots.txt file from provided directives (See Factory):
use url::Url;
use robotxt::Factory;

let txt = Factory::default()
    .header("Robots.txt Header")
    .group(["foobot"], |u| {
        u.crawl_delay(5)
            .header("Rules for Foobot: Start")
            .allow("/example/yeah.txt")
            .disallow("/example/nope.txt")
            .footer("Rules for Foobot: End")
    })
    .group(["barbot", "nombot"], |u| {
        u.crawl_delay(2)
            .disallow("/example/yeah.txt")
            .disallow("/example/nope.txt")
    })
    .sitemap("https://example.com/sitemap_1.xml").unwrap()
    .sitemap("https://example.com/sitemap_2.xml").unwrap()
    .footer("Robots.txt Footer");

println!("{}", txt.to_string());

Notes

Structs

  • The Factory struct represents a set of formatted user-agent groups that can be written to the generic writer in the robots.txt compliant format.
  • The Robots struct represents the set of directives related to the specific user-agent in the provided robots.txt file.
  • The Section struct represents a single user-agent group.

Enums

Constants