void_crawl_core 0.3.5

Rust-native CDP browser automation core — stealth-patched headless Chrome, profile leasing, captcha detection
Documentation
# Anti-bot signature corpus

The signature corpus embedded in [`src/antibot.rs`](src/antibot.rs) (the
`CORPUS_JSON` const) fingerprints which anti-bot / WAF / CDN vendor is gating an
HTTP response, and whether that vendor is *actively challenging* us (a wall) vs.
merely *present* (a CDN fronting a site that served us fine).

## Provenance & license

The vendor list and the `signals[]` / `challenge[]` split are modeled on
**`albinstman/antibot-print`** (MIT) — <https://github.com/albinstman/antibot-print>.

- **Upstream license:** MIT (compatible with this Apache-2.0 / MIT crate).
- **Reference commit:** *record the SHA you authored against here when syncing.*
- **What we took:** the *idea* and the vendor taxonomy. The actual regex
  patterns are **first-party** — hand-authored against the vendors we actually
  meet, not a wholesale vendoring of the upstream JSON. This keeps the ruleset
  small, readable, owned, and unit-testable, and avoids tracking upstream churn
  or importing a pattern that bloats compile.

## Governance

- The corpus is **not** auto-synced. Treat any change as a reviewable diff.
- **Bump `CORPUS_VERSION`** in `src/antibot.rs` on every edit. Verdicts are
  recorded with this version so a replay-grade archive can reproduce a
  classification deterministically — a verdict is captured data, never
  recomputed at replay time against a newer corpus.
- Drift is caught loudly by the **offline accuracy benchmark**
  (`tests/antibot_accuracy.rs`): a held-out labeled corpus scored for
  precision/recall. If a vendor changes its markers and a signature rots, that
  test fails rather than silently lowering field accuracy.
- The benchmark corpus is deliberately **disjoint** from the live canaries
  (`fortress.theplumber.dev`, `bot.sannysoft.com`) so we measure the detector,
  not one target.

## Adding / editing a vendor

1. Edit `CORPUS_JSON` in `src/antibot.rs`. Patterns are RE2-style (no
   backreferences / lookaround) so they run under the linear-time `regex`
   engine and are safe on attacker-controlled input. Matched case-insensitively
   against the normalized `S:`/`H:`/`B:` form.
   - `H:` patterns are anchored at the header line start (`h:server: …`).
   - `B:` patterns match their marker *anywhere* in the (bounded) body.
   - Put markers that only appear during an **active block** in `challenge[]`;
     put mere-presence markers (CDN headers) in `signals[]`.
2. Add a labeled fixture to `tests/antibot_accuracy.rs`.
3. Bump `CORPUS_VERSION`.
4. `cargo test -p void_crawl_core antibot` + `--test antibot_accuracy`.

## Vendor coverage

WAF/CDN: Cloudflare, Akamai, Imperva/Incapsula, AWS WAF, F5 BigIP, Sucuri,
CloudFront. Bot detection: DataDome, PerimeterX/HUMAN, Kasada. Challenge
widgets: reCAPTCHA, hCaptcha (Turnstile is folded into the `cloudflare` vendor).