Crate robotxt

Expand description

§robotxt

Also check out other spire-rs projects here.

The implementation of the robots.txt (or URL exclusion) protocol in the Rust programming language with the support of crawl-delay, sitemap and universal * match extensions (according to the RFC specification).

§Features

parser to enable robotxt::{Robots}. Enabled by default.
builder to enable robotxt::{RobotsBuilder, GroupBuilder}. Enabled by default.
optimal to optimize overlapping and global rules, potentially improving matching speed at the cost of longer parsing times.
serde to enable serde::{Deserialize, Serialize} implementation, allowing the caching of related rules.

§Examples

parse the most specific user-agent in the provided robots.txt file:

use robotxt::Robots;

fn main() {
    let txt = r#"
      User-Agent: foobot
      Disallow: *
      Allow: /example/
      Disallow: /example/nope.txt
    "#;

    let r = Robots::from_bytes(txt.as_bytes(), "foobot");
    assert!(r.is_relative_allowed("/example/yeah.txt"));
    assert!(!r.is_relative_allowed("/example/nope.txt"));
    assert!(!r.is_relative_allowed("/invalid/path.txt"));
}

build the new robots.txt file in a declarative manner:

use robotxt::RobotsBuilder;

fn main() -> Result<(), url::ParseError> {
    let txt = RobotsBuilder::default()
        .header("Robots.txt: Start")
        .group(["foobot"], |u| {
            u.crawl_delay(5)
                .header("Rules for Foobot: Start")
                .allow("/example/yeah.txt")
                .disallow("/example/nope.txt")
                .footer("Rules for Foobot: End")
        })
        .group(["barbot", "nombot"], |u| {
            u.crawl_delay(2)
                .disallow("/example/yeah.txt")
                .disallow("/example/nope.txt")
        })
        .sitemap("https://example.com/sitemap_1.xml".try_into()?)
        .sitemap("https://example.com/sitemap_1.xml".try_into()?)
        .footer("Robots.txt: End");

    println!("{}", txt.to_string());
    Ok(())
}

§Links

Request for Comments: 9309 on RFC-Editor.com
Introduction to Robots.txt on Google.com
How Google interprets Robots.txt on Google.com
What is Robots.txt file on Moz.com

§Notes

The parser is based on Smerity/texting_robots.
The Host directive is not supported.

Re-exports§

pub use url;

Structs§

GroupBuilderbuilder: The single formatted user-agent group.
Robotsparser: The set of directives related to the specific user-agent in the provided robots.txt file.
RobotsBuilderbuilder: The set of formatted user-agent groups that can be written in the robots.txt compliant format.

Enums§

AccessResultparser: The result of the robots.txt retrieval attempt.
Error: Unrecoverable failure during robots.txt building or parsing.

Constants§

ALL_UASparser: All user agents group, used as a default for user-agents that don’t have an explicitly defined matching group.
BYTE_LIMIT: Google currently enforces a robots.txt file size limit of 500 kibibytes (KiB). See How Google interprets Robots.txt.

Functions§

create_url: Returns the expected path to the robots.txt file as the url::Url.

Type Aliases§

Result: A specialized Result type for robotxt operations.

Crate robotxtCopy item path