fast-robots 0.1.3

# fast-robots

[![Crates.io](https://img.shields.io/crates/v/fast-robots)](https://crates.io/crates/fast-robots)
[![Crates.io Downloads](https://img.shields.io/crates/d/fast-robots)](https://crates.io/crates/fast-robots)
[![Docs.rs](https://img.shields.io/docsrs/fast-robots)](https://docs.rs/fast-robots)
[![License](https://img.shields.io/badge/license-Apache--2.0%2FMIT-blue)](#license)
[![MSRV](https://img.shields.io/badge/MSRV-1.85.1-orange)](https://blog.rust-lang.org/2025/02/20/Rust-1.85.0/)
[![Rust Edition](https://img.shields.io/badge/Rust-2024-blue)](https://doc.rust-lang.org/edition-guide/rust-2024/)

A zero-copy `robots.txt` parser for Rust with SIMD-accelerated byte scanning, RFC 9309 access checks, feature-gated extension metadata, and a tiny `argh` CLI.

<p align="center">
  <img src="logo.png" alt="fast-robots logo" width="300">
  <br>
  <sub><i>Disclaimer: I can't design. This logo was generated using ChatGPT.</i></sub>
</p>

## Motivation

`robots.txt` is line-oriented and byte-oriented. That makes a hand-rolled parser a better fit than a big parser-combinator stack: fewer allocations, direct control over error recovery, and the hot path stays obvious.

The goal is simple: parse the standardized rules correctly, preserve useful ecosystem metadata like `Sitemap` and `Crawl-delay`, and use `memchr` where delimiter scanning actually matters.

## Features

- **Zero-copy parsing**: parsed agents, rules, and extension values borrow from the original input.
- **SIMD-backed scanning**: line splitting, comments, directive separators, and wildcard matching use `memchr`/`memmem` primitives.
- **RFC 9309 core**:
  - `User-agent`
  - `Allow`
  - `Disallow`
  - `#` comments
  - `*` wildcard matching
  - `$` end-anchor matching
- **Correct access semantics**:
  - matching groups are merged
  - `*` fallback group is used only when no exact user-agent group matches
  - longest matching rule wins
  - `Allow` wins ties
  - empty `Disallow:` does not block anything
  - `/robots.txt` is implicitly allowed
- **Feature-gated extensions**: `Sitemap`, `Crawl-delay`, `Host`, `Clean-param`, and unknown directives are collected behind the `extensions` feature.
- **CLI included**: inspect parsed files and check whether a path is allowed from the terminal.
- **Small dependency surface**: runtime dependencies are currently `memchr` and `argh`.

## Installation

Add this to your `Cargo.toml`:

```toml
[dependencies]
fast-robots = "0.1.0"
```

The `extensions` feature is enabled by default:

```toml
[dependencies]
fast-robots = { version = "0.1.0", default-features = false }
```

## Usage

```rust
use fast_robots::RobotsTxt;

let input = r#"
User-agent: *
Disallow: /private/
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml
"#;

let robots = RobotsTxt::parse(input);

assert!(!robots.is_allowed("ExampleBot", "/private/file.html"));
assert!(robots.is_allowed("ExampleBot", "/private/public/file.html"));
```

For many checks against the same parsed file, build a reusable matcher once:

```rust
use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse("User-agent: *\nDisallow: /private/\n");
let matcher = robots.matcher();

assert!(!matcher.is_allowed("ExampleBot", "/private/file.html"));
assert!(matcher.is_allowed("ExampleBot", "/public/file.html"));
```

`RobotsTxt::is_allowed()` is still the lowest-overhead choice for one-off checks. `RobotsTxt::matcher()` allocates an index and precomputes rule metadata, which is intended for repeated checks against the same `robots.txt`.

### Fallible Parsing

`RobotsTxt::parse(&str)` is intentionally tolerant and infallible. Malformed lines are ignored because crawlers are expected to use the parseable rules they can recover.

Use the fallible byte APIs when reading untrusted files directly:

```rust
use fast_robots::{ParseOptions, RobotsTxt};

let bytes = b"User-agent: *\nDisallow: /private\n";
let robots = RobotsTxt::parse_bytes(bytes)?;

assert!(!robots.is_allowed("ExampleBot", "/private"));

let robots = RobotsTxt::parse_bytes_with_options(
    bytes,
    ParseOptions {
        max_bytes: Some(512 * 1024),
    },
)?;

assert!(!robots.is_allowed("ExampleBot", "/private"));
# Ok::<(), fast_robots::ParseError>(())
```

Hard errors are reserved for conditions that prevent safe parsing, such as invalid UTF-8 or inputs over the configured size limit.

### Diagnostics

Use diagnostics when you want validator-style feedback without changing tolerant parser behavior:

```rust
use fast_robots::{ParseWarningKind, RobotsTxt};

let report = RobotsTxt::parse_with_diagnostics(
    "Disallow: /\nMissing separator\nUser-agent: *\nDisallow: /private\n",
);

assert_eq!(report.warnings.len(), 2);
assert!(matches!(
    report.warnings[0].kind,
    ParseWarningKind::RuleBeforeUserAgent { .. }
));
assert!(!report.robots.is_allowed("ExampleBot", "/private"));
```

### Extensions

With the default `extensions` feature, non-core records are preserved as metadata:

```rust
use fast_robots::RobotsTxt;

let robots = RobotsTxt::parse(r#"
Sitemap: https://example.com/sitemap.xml
User-agent: Bingbot
Crawl-delay: 5
Disallow: /slow/
Host: example.com
Clean-param: ref /shop
X-Experimental: yes
"#);

assert_eq!(robots.extensions.sitemaps, ["https://example.com/sitemap.xml"]);
assert_eq!(robots.extensions.crawl_delays[0].agents, ["Bingbot"]);
assert_eq!(robots.extensions.crawl_delays[0].value, "5");
assert!(!robots.is_allowed("Bingbot", "/slow/page.html"));
```

Extensions are metadata only. They do not affect `is_allowed()`.

### CLI

Parse a file:

```bash
cargo run -- parse robots.txt
```

Check a path:

```bash
cargo run -- check robots.txt --agent Googlebot --path /private/page.html
```

Exit codes for `check`:

- `0`: allowed
- `1`: disallowed
- `2`: file read error

## How it works

1. **Line scan**: the parser walks the input with `memchr(b'\n', ...)` and strips optional `\r`.
2. **Comment scan**: `memchr(b'#', ...)` removes inline comments.
3. **Directive split**: `memchr(b':', ...)` separates key/value records.
4. **Core parse**: `user-agent`, `allow`, and `disallow` are matched ASCII-case-insensitively.
5. **Extension collection**: when enabled, non-core records are stored without changing group boundaries.
6. **Access check**: matching groups are evaluated using longest-match semantics, with `Allow` preferred on equal specificity. `RobotsTxt::matcher()` can pre-index groups and rule metadata for repeated checks.

## Why not nom?

`nom` is good, but this format is mostly delimiter scanning and small state transitions. A manual parser keeps the important choices visible:

- which bytes are scanned with SIMD-backed routines
- how malformed lines recover
- when groups start and end
- which records are access-control rules versus metadata
- how much allocation happens

Parser combinators can still be useful for more complex formats. Here they would mostly hide a simple loop.

## Extension Semantics

`fast-robots` treats extensions conservatively:

- `Sitemap`: global metadata; can appear anywhere.
- `Crawl-delay`: stored with the current group agents when present.
- `Host`: stored as Yandex-style metadata.
- `Clean-param`: stored as Yandex-style metadata.
- unknown directives: stored as `Directive { key, value }`.

Other records must not terminate groups or interfere with RFC 9309 parsing.

## Building

```bash
cargo build
cargo test
cargo test --no-default-features
cargo clippy --all-targets --all-features
```

## Benchmarks

Benchmarks use Criterion.rs and generated fixtures so large test data does not need to live in the repository. Current results are tracked in [BENCHMARK.md](BENCHMARK.md).

Current benchmark groups:

| Group | Workload | Goal |
|-------|----------|------|
| `parse` | tiny, common, many groups, many rules, wildcard-heavy, extension-heavy, 500 KiB | parser throughput |
| `match` | many rules, wildcard-heavy | `is_allowed()` and precompiled matcher throughput after parsing once |
| `parse_match` | tiny, common, many rules, 500 KiB | end-to-end parse plus access decision |

The `parse_match` group compares `fast-robots` against `robotstxt`, the Rust port of Google's robots.txt parser and matcher. This is an API-level comparison, not a claim that the two crates currently have identical behavior for every edge case.

Run all benchmarks:

```bash
cargo bench
```

Run only this crate's benchmark target:

```bash
cargo bench --bench robots
```

Quick local sanity check with a smaller sample size:

```bash
cargo bench --bench robots -- --sample-size 10 --warm-up-time 0.1 --measurement-time 0.2
```

## Caveats

- **Not an authorization system**: `robots.txt` is a crawler cooperation protocol, not access control.
- **UTF-8 required**: `parse_bytes` methods validate UTF-8 and return a `ParseError` for invalid encoding. Non-UTF-8 encodings (e.g., Latin-1, Windows-1252) are not supported.
- **No URI percent-normalization yet**: RFC 9309 has specific percent-encoding comparison rules. The current matcher focuses on path pattern semantics and should grow a normalization layer before claiming full crawler equivalence.
- **Extensions vary by crawler**: Google ignores `Crawl-delay`; Bing honors it; other crawlers differ. This crate stores extension metadata but does not enforce crawl scheduling.
- **SIMD is delegated**: `memchr` selects optimized implementations where supported and falls back safely elsewhere.

## Choosing Strictness

| Mode | Cargo config | Use case |
|------|--------------|----------|
| Core + extensions | `fast-robots = "0.1"` | most applications that want sitemaps and metadata |
| Core only | `fast-robots = { version = "0.1", default-features = false }` | strict RFC access checks with less metadata |

## Security

Please see [SECURITY.md](SECURITY.md) for vulnerability reporting.

## License

Licensed under either of:

- **Apache License, Version 2.0** ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
- **MIT license** ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.