fast-robots
A zero-copy robots.txt parser for Rust with SIMD-accelerated byte scanning, RFC 9309 access checks, feature-gated extension metadata, and a tiny argh CLI.
Motivation
robots.txt is line-oriented and byte-oriented. That makes a hand-rolled parser a better fit than a big parser-combinator stack: fewer allocations, direct control over error recovery, and the hot path stays obvious.
The goal is simple: parse the standardized rules correctly, preserve useful ecosystem metadata like Sitemap and Crawl-delay, and use memchr where delimiter scanning actually matters.
Features
- Zero-copy parsing: parsed agents, rules, and extension values borrow from the original input.
- SIMD-backed scanning: line splitting, comments, directive separators, and wildcard matching use
memchr/memmemprimitives. - RFC 9309 core:
User-agentAllowDisallow#comments*wildcard matching$end-anchor matching
- Correct access semantics:
- matching groups are merged
*fallback group is used only when no exact user-agent group matches- longest matching rule wins
Allowwins ties- empty
Disallow:does not block anything /robots.txtis implicitly allowed
- Feature-gated extensions:
Sitemap,Crawl-delay,Host,Clean-param, and unknown directives are collected behind theextensionsfeature. - CLI included: inspect parsed files and check whether a path is allowed from the terminal.
- Small dependency surface: runtime dependencies are currently
memchrandargh.
Installation
Add this to your Cargo.toml:
[]
= "0.1.0"
The extensions feature is enabled by default:
[]
= { = "0.1.0", = false }
Usage
use RobotsTxt;
let input = r#"
User-agent: *
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml
"#;
let robots = parse;
assert!;
assert!;
For many checks against the same parsed file, build a reusable matcher once:
use RobotsTxt;
let robots = parse;
let matcher = robots.matcher;
assert!;
assert!;
RobotsTxt::is_allowed() is still the lowest-overhead choice for one-off checks. RobotsTxt::matcher() allocates an index and precomputes rule metadata, which is intended for repeated checks against the same robots.txt.
Fallible Parsing
RobotsTxt::parse(&str) is intentionally tolerant and infallible. Malformed lines are ignored because crawlers are expected to use the parseable rules they can recover.
Use the fallible byte APIs when reading untrusted files directly:
use ;
let bytes = b"User-agent: *\nDisallow: /private\n";
let robots = parse_bytes?;
assert!;
let robots = parse_bytes_with_options?;
assert!;
# Ok::
Hard errors are reserved for conditions that prevent safe parsing, such as invalid UTF-8 or inputs over the configured size limit.
Diagnostics
Use diagnostics when you want validator-style feedback without changing tolerant parser behavior:
use ;
let report = parse_with_diagnostics;
assert_eq!;
assert!;
assert!;
Extensions
With the default extensions feature, non-core records are preserved as metadata:
use RobotsTxt;
let robots = parse;
assert_eq!;
assert_eq!;
assert_eq!;
assert!;
Extensions are metadata only. They do not affect is_allowed().
CLI
Parse a file:
Check a path:
Exit codes for check:
0: allowed1: disallowed2: file read error
How it works
- Line scan: the parser walks the input with
memchr(b'\n', ...)and strips optional\r. - Comment scan:
memchr(b'#', ...)removes inline comments. - Directive split:
memchr(b':', ...)separates key/value records. - Core parse:
user-agent,allow, anddisalloware matched ASCII-case-insensitively. - Extension collection: when enabled, non-core records are stored without changing group boundaries.
- Access check: matching groups are evaluated using longest-match semantics, with
Allowpreferred on equal specificity.RobotsTxt::matcher()can pre-index groups and rule metadata for repeated checks.
Why not nom?
nom is good, but this format is mostly delimiter scanning and small state transitions. A manual parser keeps the important choices visible:
- which bytes are scanned with SIMD-backed routines
- how malformed lines recover
- when groups start and end
- which records are access-control rules versus metadata
- how much allocation happens
Parser combinators can still be useful for more complex formats. Here they would mostly hide a simple loop.
Extension Semantics
fast-robots treats extensions conservatively:
Sitemap: global metadata; can appear anywhere.Crawl-delay: stored with the current group agents when present.Host: stored as Yandex-style metadata.Clean-param: stored as Yandex-style metadata.- unknown directives: stored as
Directive { key, value }.
Other records must not terminate groups or interfere with RFC 9309 parsing.
Building
Benchmarks
Benchmarks use Criterion.rs and generated fixtures so large test data does not need to live in the repository. Current results are tracked in BENCHMARK.md.
Current benchmark groups:
| Group | Workload | Goal |
|---|---|---|
parse |
tiny, common, many groups, many rules, wildcard-heavy, extension-heavy, 500 KiB | parser throughput |
match |
many rules, wildcard-heavy | is_allowed() and precompiled matcher throughput after parsing once |
parse_match |
tiny, common, many rules, 500 KiB | end-to-end parse plus access decision |
The parse_match group compares fast-robots against robotstxt, the Rust port of Google's robots.txt parser and matcher. This is an API-level comparison, not a claim that the two crates currently have identical behavior for every edge case.
Run all benchmarks:
Run only this crate's benchmark target:
Quick local sanity check with a smaller sample size:
Caveats
- Not an authorization system:
robots.txtis a crawler cooperation protocol, not access control. - UTF-8 required:
parse_bytesmethods validate UTF-8 and return aParseErrorfor invalid encoding. Non-UTF-8 encodings (e.g., Latin-1, Windows-1252) are not supported. - No URI percent-normalization yet: RFC 9309 has specific percent-encoding comparison rules. The current matcher focuses on path pattern semantics and should grow a normalization layer before claiming full crawler equivalence.
- Extensions vary by crawler: Google ignores
Crawl-delay; Bing honors it; other crawlers differ. This crate stores extension metadata but does not enforce crawl scheduling. - SIMD is delegated:
memchrselects optimized implementations where supported and falls back safely elsewhere.
Choosing Strictness
| Mode | Cargo config | Use case |
|---|---|---|
| Core + extensions | fast-robots = "0.1" |
most applications that want sitemaps and metadata |
| Core only | fast-robots = { version = "0.1", default-features = false } |
strict RFC access checks with less metadata |
Security
Please see SECURITY.md for vulnerability reporting.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.