# fast-robots
[](https://crates.io/crates/fast-robots)
[](https://crates.io/crates/fast-robots)
[](https://docs.rs/fast-robots)
[](#license)
[](https://blog.rust-lang.org/2025/02/20/Rust-1.85.0/)
[](https://doc.rust-lang.org/edition-guide/rust-2024/)
A zero-copy `robots.txt` parser for Rust with SIMD-accelerated byte scanning, RFC 9309 access checks, feature-gated extension metadata, and a tiny `argh` CLI.
<p align="center">
<img src="logo.png" alt="fast-robots logo" width="300">
<br>
<sub><i>Disclaimer: I can't design. This logo was generated using ChatGPT.</i></sub>
</p>
## Motivation
`robots.txt` is line-oriented and byte-oriented. That makes a hand-rolled parser a better fit than a big parser-combinator stack: fewer allocations, direct control over error recovery, and the hot path stays obvious.
The goal is simple: parse the standardized rules correctly, preserve useful ecosystem metadata like `Sitemap` and `Crawl-delay`, and use `memchr` where delimiter scanning actually matters.
## Features
- **Zero-copy parsing**: parsed agents, rules, and extension values borrow from the original input.
- **SIMD-backed scanning**: line splitting, comments, directive separators, and wildcard matching use `memchr`/`memmem` primitives.
- **RFC 9309 core**:
- `User-agent`
- `Allow`
- `Disallow`
- `#` comments
- `*` wildcard matching
- `$` end-anchor matching
- **Correct access semantics**:
- matching groups are merged
- `*` fallback group is used only when no exact user-agent group matches
- longest matching rule wins
- `Allow` wins ties
- empty `Disallow:` does not block anything
- `/robots.txt` is implicitly allowed
- **Feature-gated extensions**: `Sitemap`, `Crawl-delay`, `Host`, `Clean-param`, and unknown directives are collected behind the `extensions` feature.
- **CLI included**: inspect parsed files and check whether a path is allowed from the terminal.
- **Small dependency surface**: runtime dependencies are currently `memchr` and `argh`.
## Installation
Add this to your `Cargo.toml`:
```toml
[dependencies]
fast-robots = "0.1.0"
```
The `extensions` feature is enabled by default:
```toml
[dependencies]
fast-robots = { version = "0.1.0", default-features = false }
```
## Usage
```rust
use fast_robots::RobotsTxt;
let input = r#"
User-agent: *
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml
"#;
let robots = RobotsTxt::parse(input);
assert!(!robots.is_allowed("ExampleBot", "/private/file.html"));
assert!(robots.is_allowed("ExampleBot", "/private/public/file.html"));
```
For many checks against the same parsed file, build a reusable matcher once:
```rust
use fast_robots::RobotsTxt;
let robots = RobotsTxt::parse("User-agent: *\nDisallow: /private/\n");
let matcher = robots.matcher();
assert!(!matcher.is_allowed("ExampleBot", "/private/file.html"));
assert!(matcher.is_allowed("ExampleBot", "/public/file.html"));
```
`RobotsTxt::is_allowed()` is still the lowest-overhead choice for one-off checks. `RobotsTxt::matcher()` allocates an index and precomputes rule metadata, which is intended for repeated checks against the same `robots.txt`.
### Fallible Parsing
`RobotsTxt::parse(&str)` is intentionally tolerant and infallible. Malformed lines are ignored because crawlers are expected to use the parseable rules they can recover.
Use the fallible byte APIs when reading untrusted files directly:
```rust
use fast_robots::{ParseOptions, RobotsTxt};
let bytes = b"User-agent: *\nDisallow: /private\n";
let robots = RobotsTxt::parse_bytes(bytes)?;
assert!(!robots.is_allowed("ExampleBot", "/private"));
let robots = RobotsTxt::parse_bytes_with_options(
bytes,
ParseOptions {
max_bytes: Some(512 * 1024),
},
)?;
assert!(!robots.is_allowed("ExampleBot", "/private"));
# Ok::<(), fast_robots::ParseError>(())
```
Hard errors are reserved for conditions that prevent safe parsing, such as invalid UTF-8 or inputs over the configured size limit.
### Diagnostics
Use diagnostics when you want validator-style feedback without changing tolerant parser behavior:
```rust
use fast_robots::{ParseWarningKind, RobotsTxt};
let report = RobotsTxt::parse_with_diagnostics(
"Disallow: /\nMissing separator\nUser-agent: *\nDisallow: /private\n",
);
assert_eq!(report.warnings.len(), 2);
assert!(matches!(
report.warnings[0].kind,
ParseWarningKind::RuleBeforeUserAgent { .. }
));
assert!(!report.robots.is_allowed("ExampleBot", "/private"));
```
### Extensions
With the default `extensions` feature, non-core records are preserved as metadata:
```rust
use fast_robots::RobotsTxt;
let robots = RobotsTxt::parse(r#"
Sitemap: https://example.com/sitemap.xml
User-agent: Bingbot
Crawl-delay: 5
Disallow: /slow/
Host: example.com
Clean-param: ref /shop
X-Experimental: yes
"#);
assert_eq!(robots.extensions.sitemaps, ["https://example.com/sitemap.xml"]);
assert_eq!(robots.extensions.crawl_delays[0].agents, ["Bingbot"]);
assert_eq!(robots.extensions.crawl_delays[0].value, "5");
assert!(!robots.is_allowed("Bingbot", "/slow/page.html"));
```
Extensions are metadata only. They do not affect `is_allowed()`.
### CLI
Parse a file:
```bash
cargo run -- parse robots.txt
```
Check a path:
```bash
cargo run -- check robots.txt --agent Googlebot --path /private/page.html
```
Exit codes for `check`:
- `0`: allowed
- `1`: disallowed
- `2`: file read error
## How it works
1. **Line scan**: the parser walks the input with `memchr(b'\n', ...)` and strips optional `\r`.
2. **Comment scan**: `memchr(b'#', ...)` removes inline comments.
3. **Directive split**: `memchr(b':', ...)` separates key/value records.
4. **Core parse**: `user-agent`, `allow`, and `disallow` are matched ASCII-case-insensitively.
5. **Extension collection**: when enabled, non-core records are stored without changing group boundaries.
6. **Access check**: matching groups are evaluated using longest-match semantics, with `Allow` preferred on equal specificity. `RobotsTxt::matcher()` can pre-index groups and rule metadata for repeated checks.
## Why not nom?
`nom` is good, but this format is mostly delimiter scanning and small state transitions. A manual parser keeps the important choices visible:
- which bytes are scanned with SIMD-backed routines
- how malformed lines recover
- when groups start and end
- which records are access-control rules versus metadata
- how much allocation happens
Parser combinators can still be useful for more complex formats. Here they would mostly hide a simple loop.
## Extension Semantics
`fast-robots` treats extensions conservatively:
- `Sitemap`: global metadata; can appear anywhere.
- `Crawl-delay`: stored with the current group agents when present.
- `Host`: stored as Yandex-style metadata.
- `Clean-param`: stored as Yandex-style metadata.
- unknown directives: stored as `Directive { key, value }`.
Other records must not terminate groups or interfere with RFC 9309 parsing.
## Building
```bash
cargo build
cargo test
cargo test --no-default-features
cargo clippy --all-targets --all-features
```
## Benchmarks
Benchmarks use Criterion.rs and generated fixtures so large test data does not need to live in the repository. Current results are tracked in [BENCHMARK.md](BENCHMARK.md).
Current benchmark groups:
| `parse` | tiny, common, many groups, many rules, wildcard-heavy, extension-heavy, 500 KiB | parser throughput |
| `match` | many rules, wildcard-heavy | `is_allowed()` and precompiled matcher throughput after parsing once |
| `parse_match` | tiny, common, many rules, 500 KiB | end-to-end parse plus access decision |
The `parse_match` group compares `fast-robots` against `robotstxt`, the Rust port of Google's robots.txt parser and matcher. This is an API-level comparison, not a claim that the two crates currently have identical behavior for every edge case.
Run all benchmarks:
```bash
cargo bench
```
Run only this crate's benchmark target:
```bash
cargo bench --bench robots
```
Quick local sanity check with a smaller sample size:
```bash
cargo bench --bench robots -- --sample-size 10 --warm-up-time 0.1 --measurement-time 0.2
```
## Caveats
- **Not an authorization system**: `robots.txt` is a crawler cooperation protocol, not access control.
- **UTF-8 required**: `parse_bytes` methods validate UTF-8 and return a `ParseError` for invalid encoding. Non-UTF-8 encodings (e.g., Latin-1, Windows-1252) are not supported.
- **No URI percent-normalization yet**: RFC 9309 has specific percent-encoding comparison rules. The current matcher focuses on path pattern semantics and should grow a normalization layer before claiming full crawler equivalence.
- **Extensions vary by crawler**: Google ignores `Crawl-delay`; Bing honors it; other crawlers differ. This crate stores extension metadata but does not enforce crawl scheduling.
- **SIMD is delegated**: `memchr` selects optimized implementations where supported and falls back safely elsewhere.
## Choosing Strictness
| Core + extensions | `fast-robots = "0.1"` | most applications that want sitemaps and metadata |
| Core only | `fast-robots = { version = "0.1", default-features = false }` | strict RFC access checks with less metadata |
## Security
Please see [SECURITY.md](SECURITY.md) for vulnerability reporting.
## License
Licensed under either of:
- **Apache License, Version 2.0** ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- **MIT license** ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
### Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.