iscrawl
Tiny crawler/bot detection from User-Agent strings.
sub-140ns per call
zero allocations
zero dependencies
Install
[]
= "1.0"
Use
use is_crawler;
assert!;
assert!;
One function. One bool. That's it.
Why fast
- Stack buffer of 512 bytes, no heap.
- First-byte lookup table prunes 99% of needle scans.
- Single pass over the lowered input.
lto = "fat",codegen-units = 1,panic = "abort".
Benchmarked over a corpus of ~25k real User-Agents: best case ~130 ns/call on x86_64.
How it decides
- Empty input counts as crawler.
- Input over 512 bytes is rejected (returns
false). - If any crawler keyword (
bot,crawl,spider,scanner,+http,@,archive, ...) appears: crawler. - If the UA does not start with
Mozilla/orOpera/and has no known browser engine token (gecko,webkit,chrome,firefox,msie,edge,opera, ...): crawler. - If the UA starts with
Mozilla/orOpera/but is missing both an engine token and the(compatible;marker: crawler. - Otherwise: browser.
Heuristic, not a database. Trades a sliver of recall for speed and zero maintenance.
Accuracy
Measured against bundled fixture corpora:
| corpus | size | result |
|---|---|---|
| crawler_user_agents.txt | 2,149 | 95.4% detected |
| loadkpi_crawlers.txt | 3,696 | 94.7% detected |
| crawler_user_agents_pgts.txt | 156 | 98.1% detected |
| browser_user_agents.txt | 19,897 | <1% false positive |
Run cargo test --release to verify on your machine.
Develop
Format
Standard formatting across Rust + markdown/yaml.
CI enforces cargo fmt --check and cargo clippy -D warnings on every push.
Publish
Pushes to main/master trigger .github/workflows/publish.yml:
- Reads
name+versionfromCargo.toml. - Skips if that version is already on crates.io.
- Tests on Ubuntu, macOS, Windows × stable, beta.
- Runs
cargo publishusingCARGO_REGISTRY_TOKEN. - Tags the commit
vX.Y.Z.
Bump version in Cargo.toml, push to main, done.
Funding
If this saved you time, buy me a coffee.
License
Apache-2.0. See LICENSE.