iscrawl 1.0.0

Tiny, allocation-free crawler/bot detection from User-Agent strings. Sub-140ns per call.
Documentation
  • Coverage
  • 100%
    2 out of 2 items documented2 out of 2 items with examples
  • Size
  • Source code size: 2.86 MB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 252.84 kB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 9s Average build duration of successful builds.
  • all releases: 8s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • Homepage
  • tn3w/iscrawl
    0 0 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • tn3w

iscrawl

Crates.io Docs.rs License

Tiny crawler/bot detection from User-Agent strings.

sub-140ns per call
zero allocations
zero dependencies

Install

[dependencies]
iscrawl = "1.0"

Use

use iscrawl::is_crawler;

assert!(is_crawler("Googlebot/2.1 (+http://www.google.com/bot.html)"));
assert!(!is_crawler(
    "Mozilla/5.0 (X11; Linux x86_64; rv:115.0) Gecko/20100101 Firefox/115.0"
));

One function. One bool. That's it.

Why fast

  • Stack buffer of 512 bytes, no heap.
  • First-byte lookup table prunes 99% of needle scans.
  • Single pass over the lowered input.
  • lto = "fat", codegen-units = 1, panic = "abort".

Benchmarked over a corpus of ~25k real User-Agents: best case ~130 ns/call on x86_64.

How it decides

  1. Empty input counts as crawler.
  2. Input over 512 bytes is rejected (returns false).
  3. If any crawler keyword (bot, crawl, spider, scanner, +http, @, archive, ...) appears: crawler.
  4. If the UA does not start with Mozilla/ or Opera/ and has no known browser engine token (gecko, webkit, chrome, firefox, msie, edge, opera, ...): crawler.
  5. If the UA starts with Mozilla/ or Opera/ but is missing both an engine token and the (compatible; marker: crawler.
  6. Otherwise: browser.

Heuristic, not a database. Trades a sliver of recall for speed and zero maintenance.

Accuracy

Measured against bundled fixture corpora:

corpus size result
crawler_user_agents.txt 2,149 95.4% detected
loadkpi_crawlers.txt 3,696 94.7% detected
crawler_user_agents_pgts.txt 156 98.1% detected
browser_user_agents.txt 19,897 <1% false positive

Run cargo test --release to verify on your machine.

Develop

cargo build --release
cargo test --release
cargo test --doc
cargo doc --no-deps --open

Format

Standard formatting across Rust + markdown/yaml.

cargo fmt --all
cargo clippy --all-targets -- -D warnings
npx --yes prettier --write --single-quote --print-width=100 --trailing-comma=es5 --end-of-line=lf "**/*.{md,yml}"

CI enforces cargo fmt --check and cargo clippy -D warnings on every push.

Publish

Pushes to main/master trigger .github/workflows/publish.yml:

  1. Reads name + version from Cargo.toml.
  2. Skips if that version is already on crates.io.
  3. Tests on Ubuntu, macOS, Windows × stable, beta.
  4. Runs cargo publish using CARGO_REGISTRY_TOKEN.
  5. Tags the commit vX.Y.Z.

Bump version in Cargo.toml, push to main, done.

Funding

If this saved you time, buy me a coffee.

License

Apache-2.0. See LICENSE.