fast-robots 0.1.2

A zero-copy robots.txt parser with SIMD-accelerated byte scanning.
Documentation

fast-robots

Crates.io Crates.io Downloads Docs.rs License MSRV Rust Edition

A zero-copy robots.txt parser for Rust with SIMD-accelerated byte scanning, RFC 9309 access checks, feature-gated extension metadata, and a tiny argh CLI.

Motivation

robots.txt is line-oriented and byte-oriented. That makes a hand-rolled parser a better fit than a big parser-combinator stack: fewer allocations, direct control over error recovery, and the hot path stays obvious.

The goal is simple: parse the standardized rules correctly, preserve useful ecosystem metadata like Sitemap and Crawl-delay, and use memchr where delimiter scanning actually matters.

Features

  • Zero-copy parsing: parsed agents, rules, and extension values borrow from the original input.
  • SIMD-backed scanning: line splitting, comments, directive separators, and wildcard matching use memchr/memmem primitives.
  • RFC 9309 core:
    • User-agent
    • Allow
    • Disallow
    • # comments
    • * wildcard matching
    • $ end-anchor matching
  • Correct access semantics:
    • matching groups are merged
    • * fallback group is used only when no exact user-agent group matches
    • longest matching rule wins
    • Allow wins ties
    • empty Disallow: does not block anything
    • /robots.txt is implicitly allowed
  • Feature-gated extensions: Sitemap, Crawl-delay, Host, Clean-param, and unknown directives are collected behind the extensions feature.
  • CLI included: inspect parsed files and check whether a path is allowed from the terminal.
  • Small dependency surface: runtime dependencies are currently memchr and argh.

Installation

Add this to your Cargo.toml:

[dependencies]
fast-robots = "0.1.0"

The extensions feature is enabled by default:

[dependencies]
fast-robots = { version = "0.1.0", default-features = false }

Usage

use fast_robots::RobotsTxt;

let input = r#"
User-agent: *
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml
"#;

let robots = RobotsTxt::parse(input);

assert!(!robots.is_allowed("ExampleBot", "/private/file.html"));
assert!(robots.is_allowed("ExampleBot", "/private/public/file.html"));

For many checks against the same parsed file, build a reusable matcher once:

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse("User-agent: *\nDisallow: /private/\n");
let matcher = robots.matcher();

assert!(!matcher.is_allowed("ExampleBot", "/private/file.html"));
assert!(matcher.is_allowed("ExampleBot", "/public/file.html"));

RobotsTxt::is_allowed() is still the lowest-overhead choice for one-off checks. RobotsTxt::matcher() allocates an index and precomputes rule metadata, which is intended for repeated checks against the same robots.txt.

Fallible Parsing

RobotsTxt::parse(&str) is intentionally tolerant and infallible. Malformed lines are ignored because crawlers are expected to use the parseable rules they can recover.

Use the fallible byte APIs when reading untrusted files directly:

use fast_robots::{ParseOptions, RobotsTxt};

let bytes = b"User-agent: *\nDisallow: /private\n";
let robots = RobotsTxt::parse_bytes(bytes)?;

assert!(!robots.is_allowed("ExampleBot", "/private"));

let robots = RobotsTxt::parse_bytes_with_options(
    bytes,
    ParseOptions {
        max_bytes: Some(512 * 1024),
    },
)?;

assert!(!robots.is_allowed("ExampleBot", "/private"));
# Ok::<(), fast_robots::ParseError>(())

Hard errors are reserved for conditions that prevent safe parsing, such as invalid UTF-8 or inputs over the configured size limit.

Diagnostics

Use diagnostics when you want validator-style feedback without changing tolerant parser behavior:

use fast_robots::{ParseWarningKind, RobotsTxt};

let report = RobotsTxt::parse_with_diagnostics(
    "Disallow: /\nMissing separator\nUser-agent: *\nDisallow: /private\n",
);

assert_eq!(report.warnings.len(), 2);
assert!(matches!(
    report.warnings[0].kind,
    ParseWarningKind::RuleBeforeUserAgent { .. }
));
assert!(!report.robots.is_allowed("ExampleBot", "/private"));

Extensions

With the default extensions feature, non-core records are preserved as metadata:

use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse(r#"
Sitemap: https://example.com/sitemap.xml
User-agent: Bingbot
Crawl-delay: 5
Disallow: /slow/
Host: example.com
Clean-param: ref /shop
X-Experimental: yes
"#);

assert_eq!(robots.extensions.sitemaps, ["https://example.com/sitemap.xml"]);
assert_eq!(robots.extensions.crawl_delays[0].agents, ["Bingbot"]);
assert_eq!(robots.extensions.crawl_delays[0].value, "5");
assert!(!robots.is_allowed("Bingbot", "/slow/page.html"));

Extensions are metadata only. They do not affect is_allowed().

CLI

Parse a file:

cargo run -- parse robots.txt

Check a path:

cargo run -- check robots.txt --agent Googlebot --path /private/page.html

Exit codes for check:

  • 0: allowed
  • 1: disallowed
  • 2: file read error

How it works

  1. Line scan: the parser walks the input with memchr(b'\n', ...) and strips optional \r.
  2. Comment scan: memchr(b'#', ...) removes inline comments.
  3. Directive split: memchr(b':', ...) separates key/value records.
  4. Core parse: user-agent, allow, and disallow are matched ASCII-case-insensitively.
  5. Extension collection: when enabled, non-core records are stored without changing group boundaries.
  6. Access check: matching groups are evaluated using longest-match semantics, with Allow preferred on equal specificity. RobotsTxt::matcher() can pre-index groups and rule metadata for repeated checks.

Why not nom?

nom is good, but this format is mostly delimiter scanning and small state transitions. A manual parser keeps the important choices visible:

  • which bytes are scanned with SIMD-backed routines
  • how malformed lines recover
  • when groups start and end
  • which records are access-control rules versus metadata
  • how much allocation happens

Parser combinators can still be useful for more complex formats. Here they would mostly hide a simple loop.

Extension Semantics

fast-robots treats extensions conservatively:

  • Sitemap: global metadata; can appear anywhere.
  • Crawl-delay: stored with the current group agents when present.
  • Host: stored as Yandex-style metadata.
  • Clean-param: stored as Yandex-style metadata.
  • unknown directives: stored as Directive { key, value }.

Other records must not terminate groups or interfere with RFC 9309 parsing.

Building

cargo build
cargo test
cargo test --no-default-features
cargo clippy --all-targets --all-features

Benchmarks

Benchmarks use Criterion.rs and generated fixtures so large test data does not need to live in the repository. Current results are tracked in BENCHMARK.md.

Current benchmark groups:

Group Workload Goal
parse tiny, common, many groups, many rules, wildcard-heavy, extension-heavy, 500 KiB parser throughput
match many rules, wildcard-heavy is_allowed() and precompiled matcher throughput after parsing once
parse_match tiny, common, many rules, 500 KiB end-to-end parse plus access decision

The parse_match group compares fast-robots against robotstxt, the Rust port of Google's robots.txt parser and matcher. This is an API-level comparison, not a claim that the two crates currently have identical behavior for every edge case.

Run all benchmarks:

cargo bench

Run only this crate's benchmark target:

cargo bench --bench robots

Quick local sanity check with a smaller sample size:

cargo bench --bench robots -- --sample-size 10 --warm-up-time 0.1 --measurement-time 0.2

Caveats

  • Not an authorization system: robots.txt is a crawler cooperation protocol, not access control.
  • UTF-8 required: parse_bytes methods validate UTF-8 and return a ParseError for invalid encoding. Non-UTF-8 encodings (e.g., Latin-1, Windows-1252) are not supported.
  • No URI percent-normalization yet: RFC 9309 has specific percent-encoding comparison rules. The current matcher focuses on path pattern semantics and should grow a normalization layer before claiming full crawler equivalence.
  • Extensions vary by crawler: Google ignores Crawl-delay; Bing honors it; other crawlers differ. This crate stores extension metadata but does not enforce crawl scheduling.
  • SIMD is delegated: memchr selects optimized implementations where supported and falls back safely elsewhere.

Choosing Strictness

Mode Cargo config Use case
Core + extensions fast-robots = "0.1" most applications that want sitemaps and metadata
Core only fast-robots = { version = "0.1", default-features = false } strict RFC access checks with less metadata

Security

Please see SECURITY.md for vulnerability reporting.

License

Licensed under either of:

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.