Crate fast_robots

Expand description

Fast, zero-copy parsing and matching for robots.txt files.

fast-robots parses the standardized User-agent, Allow, and Disallow records used by crawlers, then evaluates paths using the RFC 9309 matching rules: exact user-agent groups are preferred over *, the longest matching rule wins, and Allow wins ties.

Parsed values borrow from the original input, so parsing avoids copying rule strings, user agents, and extension metadata. Keep the input string or byte buffer alive for as long as the returned RobotsTxt is used.

§Quick Start

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse(
    "User-agent: *\n\
     Disallow: /private/\n\
     Allow: /private/public/\n",
);

assert!(!robots.is_allowed("ExampleBot", "/private/file.html"));
assert!(robots.is_allowed("ExampleBot", "/private/public/file.html"));

§Fallible Byte Parsing

Use the byte APIs when reading directly from files or HTTP responses. They reject invalid UTF-8 and inputs larger than DEFAULT_MAX_BYTES by default.

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse_bytes(b"User-agent: *\nDisallow: /tmp\n")?;
assert!(!robots.is_allowed("ExampleBot", "/tmp/cache"));

§Diagnostics

The parser is tolerant by default and ignores malformed lines it can recover from. Use diagnostics when you want validator-style warnings alongside the parsed rules.

use fast_robots::{ParseWarningKind, RobotsTxt};

let report = RobotsTxt::parse_with_diagnostics(
    "Disallow: /\nMissing separator\nUser-agent: *\nDisallow: /private\n",
);

assert!(matches!(
    report.warnings[0].kind,
    ParseWarningKind::RuleBeforeUserAgent { .. }
));
assert!(matches!(
    report.warnings[1].kind,
    ParseWarningKind::MissingSeparator { .. }
));
assert!(!report.robots.is_allowed("ExampleBot", "/private"));

§Extension Metadata

With the default extensions feature, non-core directives such as Sitemap and Crawl-delay are preserved as metadata. Extension metadata never changes RobotsTxt::is_allowed decisions.

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse(
    "Sitemap: https://example.com/sitemap.xml\n\
     User-agent: SlowBot\n\
     Crawl-delay: 5\n\
     Disallow: /slow/\n",
);

assert_eq!(robots.extensions.sitemaps, ["https://example.com/sitemap.xml"]);
assert_eq!(robots.extensions.crawl_delays[0].agents, ["SlowBot"]);
assert!(!robots.is_allowed("SlowBot", "/slow/page.html"));

Structs§

CleanParamextensions: A Clean-param directive value.
CrawlDelayextensions: A Crawl-delay directive and the group agents active when it appeared.
Directiveextensions: A non-core directive preserved as a raw key/value pair.
Extensionsextensions: Feature-gated metadata for common non-standard directives.
Group: A robots.txt group containing one or more user agents and their rules.
ParseOptions: Options shared by fallible parsing APIs.
ParseReport: Parsed rules plus any diagnostics collected during parsing.
ParseWarning: A recoverable parse issue with its one-based line number.
RobotsMatcher: Precompiled matcher for repeated access checks against one RobotsTxt.
RobotsTxt: Parsed robots.txt data.
Rule: A single Allow or Disallow rule.

Enums§

ParseError: Errors returned by fallible parsing APIs.
ParseWarningKind: Recoverable parse warning categories.
RuleKind: Access-control directive kind.

Constants§

DEFAULT_MAX_BYTES: Default maximum accepted input size for fallible parsing APIs.