crawlex 1.0.0

Stealth crawler with Chrome-perfect TLS/H2 fingerprint, render pool, hooks, persistent queue
Documentation

crawlex

CI crates.io npm License: MIT OR Apache-2.0

Stealth crawler with Chrome-perfect TLS / HTTP-2 fingerprint, render pool, hooks, and persistent queue. Written in Rust.

crawlex ships two binaries:

  • crawlex — full build with HTTP impersonation, headless Chromium rendering, stealth shim, and persistent queue.
  • crawlex-mini — HTTP-only worker (no Chromium dependency, same CLI).

It's designed to crawl real production targets that block typical headless crawlers — Cloudflare, DataDome, PerimeterX, Akamai BMP — by matching the TLS extension order, HTTP/2 SETTINGS frame, pseudo-header order, and JS fingerprint surface of stock desktop Chrome.


Install

Cargo

cargo install crawlex

⚠️ The [patch.crates-io] H2 fork is not carried by cargo install — the binary you get from crates.io has TLS-perfect fingerprint but uses upstream h2 = 0.4 pseudo-header order. For full Chrome-149 H2 match, use the GitHub Release binaries below or build from source.

npm (Node wrapper)

pnpm add -g crawlex
# or: npm i -g crawlex

The npm package includes a postinstall step that downloads the prebuilt binary for your platform from the GitHub Release.

Pre-built binary

Linux x86_64, Linux aarch64, macOS x86_64, macOS aarch64, Windows x86_64 binaries (with checksums) are attached to each GitHub Release.

Build from source

git clone --recurse-submodules https://github.com/forattini-dev/crawlex
cd crawlex
cargo build --release --features cli,sqlite,cdp-backend
./target/release/crawlex --help

Requires Rust 1.91 or newer.


Quickstart

# Crawl a single page (HTTP only, no browser).
crawlex pages https://example.com

# Crawl with browser rendering, stealth shim, persistent SQLite queue.
crawlex crawl https://example.com \
  --max-depth 3 \
  --concurrency 4 \
  --queue sqlite:///tmp/crawlex.db \
  --render

# Inspect the active TLS / HTTP-2 fingerprint that crawlex sends.
crawlex fingerprint

# Re-render an existing crawl and show the fingerprint stack used.
crawlex stealth probe https://nowsecure.nl

See docs/ for the full reference (CLI flags, configuration, hooks, events, queue backends, proxy rotation, identity bundles, render pool tuning).


What's in the box

Concern What we cover
TLS Chrome 149 ClientHello — extensions ordered like Chrome, BoringSSL ALPS, permute_extensions for ChaCha vs AES, GREASE values
HTTP/2 Pseudo-header order (method, authority, scheme, path) — Akamai BMP signature match. SETTINGS frame parameters in Chrome's order
JS fingerprint 29-section stealth shim: navigator, chrome.*, permissions, plugins, screen, timezone, battery, connection, matchMedia, WebGL (vendor / params / extensions / isEnabled), canvas (zero-preserving noise), AudioContext (FFT + offline render), Function.prototype.toString, WebGPU, performance.memory, sensors, iframe, window geometry, requestAnimationFrame throttle, performance.now() 100µs grain + jitter, sample-rate pinning, mediaDevices coherent counts, speechSynthesis voices per OS, font list, worker concurrency cap, TextMetrics (HarfBuzz analogue), WebRTC SDP / ICE / getStats scrub
Worker scope Same shim auto-attached to dedicated / shared / service workers via CDP Target.setAutoAttach + Runtime.evaluate (Camoufox port)
Render pool Headless Chromium pool with isolated user-data dirs, persona pinning, optional Lua hooks, request interception
Queue In-memory, SQLite, or Redis persistent backends with DEDUPE + RATE-LIMIT + RETRY policies
Proxy Rotator with health-checks, sticky sessions, public-interface-only WebRTC policy
Discovery Sitemap, robots, certificate transparency, DNS, Wayback, security.txt, well-known, JS endpoint extraction, network probe

Real-world coverage

Last validated: see production-validation/summary.md.

Personas (mobile vs desktop)

Five named personas ship in the catalog, each one a coherent bundle of UA

  • Sec-CH-UA + screen + viewport + DPR + GPU + fonts + media-device counts + TLS fingerprint:
Codename OS GPU Locale Form factor
tux Linux Intel UHD 630 en-US desktop 1920×1080
office Windows 10 Intel UHD 620 en-US laptop 1920×1080 (DPR 1.25)
gamer Windows 10 NVIDIA GTX 1060 pt-BR desktop 1920×1080
atlas macOS Apple M1 en-US retina 1440×900 (DPR 2.0)
pixel Android 14 Adreno 640 pt-BR mobile 412×823 (DPR 2.625)

Pick via CLI:

# Mobile Pixel-class persona (touch viewport, mobile UA, sec-ch-ua-mobile: ?1)
crawlex pages run --seed https://example.com \
  --method render --persona pixel

# macOS Apple M1 (Retina DPR 2.0, en-US)
crawlex pages run --seed https://example.com \
  --method render --persona atlas

# Windows NVIDIA gaming desktop (pt-BR)
crawlex pages run --seed https://example.com --persona gamer

The legacy --identity-preset <N> numeric form still works, but the codename is recommended — it stays stable even if catalog ordering shifts.

Browser fingerprint catalog

crawlex ships a TLS fingerprint catalog covering the last 30 Chrome stable, 30 Chromium, and 20 Firefox majors plus Edge and Safari. Choose the persona via --profile <browser>-<major>-<os>:

crawlex pages run --seed https://example.com \
  --method spoof \
  --profile chrome-149-linux

crawlex pages run --seed https://example.com \
  --profile firefox-130-macos

crawlex pages run --seed https://example.com \
  --profile chromium-122-linux

The catalog is data-driven: curl-impersonate's MPL-2.0 vendored signatures live at references/curl-impersonate/tests/signatures/*.yaml, our own captures land at src/impersonate/catalog/captured/*.yaml, and public mining oracles (tls.peet.ws + ja4db.com) populate src/impersonate/catalog/mined/*.json. All are compiled into a static registry by build.rs at compile time.

Discover what's available:

# List every fingerprint registered in the catalog
crawlex stealth catalog list

# Filter by browser family
crawlex stealth catalog list --filter chrome

# Show full ClientHello breakdown for a profile
crawlex stealth catalog show chrome-149-linux
crawlex stealth catalog show firefox-130-macos --json

Era fallback: when an exact (browser, major, os) tuple isn't yet captured, the catalog falls back to the closest era's representative profile and emits a tracing::warn so operators know an approximation is in play. Chrome eras:

Era Majors Marker change
E1 98-99 Pre-permute_extensions
E2 100-103 permute_extensions enabled
E3a 104-110 post-quantum experimentation start
E3b 111-116 curl-impersonate frontier
E4 117-123 X25519Kyber768Draft00
E5 124-131 ALPS payload reformat
E6 132-141 MLKEM768 (Kyber removed)
E7 142+ ECH wider deployment

Firefox eras (NSS-based, separate connector path):

Era Majors Marker change
ESR 91 NSS 3.68 ESR baseline
F-A 92-95 TLS 1.3 default, Kyber off
F-B 96-100 encrypted_client_hello staged
F-C 101-108 NSS 3.79 base
F-D 109-116 session_ticket reordered
FF-A 117-119 NSS 3.79+ stabilised
FF-B 120-126 KyberSlash mitigation
FF-C 127-130 TLS Encrypted ClientHello experimental

Refreshing the catalog with new captures:

# Pull JA3/JA4 hashes from public databases (validation oracles)
node scripts/mine-fingerprints.mjs

# Bulk-capture every browser version (downloads + headless launch + emit YAML)
bash scripts/sync-tls-catalog.sh

# Subset capture
CHROME_MAJORS="148 149" bash scripts/sync-tls-catalog.sh

Captcha solving

crawlex ships prevention-first captcha handling — the policy engine prefers avoiding challenges over solving them. When a challenge does fire, four solver adapters are available:

Adapter Vendors Configuration
recaptcha-invisible reCAPTCHA v3 (vanilla) none — server-side, in-house
2captcha reCAPTCHA, hCaptcha, image CRAWLEX_SOLVER_2CAPTCHA_KEY
anticaptcha reCAPTCHA, hCaptcha, FunCaptcha CRAWLEX_SOLVER_ANTICAPTCHA_KEY
vlm hCaptcha, image puzzles CRAWLEX_SOLVER_VLM_PROVIDER + key

Pick via CLI:

crawlex pages run --seed https://example.com \
  --method render \
  --captcha-solver recaptcha-invisible

# Externals refuse to run until their API key env var is set —
# this is by design, no silent paid-API calls.
CRAWLEX_SOLVER_2CAPTCHA_KEY=abc123 crawlex pages run \
  --seed https://example.com --captcha-solver 2captcha

In-house reCAPTCHA v3 invisible solver

recaptcha-invisible is a port of h4ckf0r0day/reCaptchaV3-Invisible-Solver adapted to crawlex's identity stack. Key differences from the reference:

  • Identity coherence — UA, UA-CH brands, screen, timezone, canvas hash, WebGL strings flow from the active IdentityBundle instead of hardcoded Chrome 136 / Windows defaults. Eliminates the cross-check failure mode the reference suffers when the persona disagrees with the synthesised oz payload.
  • Server-side only — no headless browser launched per challenge. Pipeline: api.jsanchorreload → regex out the token → return as g-recaptcha-response.
  • Empirical scoring 0.3-0.9 — same range the reference reports. Use as a fallback for sites where a real browser path isn't viable (rate limit, bandwidth budget). Production use should still prefer a real Chrome render via --method render for tougher detectors.

The solver lives at src/antibot/recaptcha/:

recaptcha/
├── proto.rs       # Minimal protobuf encoder (no prost dep)
├── utils.rs       # base36, cb / co / scramble_oz primitives
├── oz.rs          # Build the `oz` JSON payload
├── telemetry.rs   # Synthesise field-74 client blob (mouse, scroll, perf)
├── solver.rs      # 3-hop pipeline + token extraction
└── adapter.rs     # CaptchaSolver trait impl

Limited to ChallengeVendor::Recaptcha. Not supported: reCAPTCHA Enterprise (anchor-with-action verification), hCaptcha, Cloudflare Turnstile, DataDome, PerimeterX — those are different protocols entirely and fall back to external adapters if configured.

Tuning render timeouts

Heavy real-world targets (Cloudflare-fronted SPAs, ad-laden landing pages) can exceed the default 30 s Page.navigate deadline. Two knobs:

crawlex pages run --seed https://www.example.com \
  --method render \
  --render-request-timeout-ms 60000 \
  --navigation-lifecycle domcontentloaded \
  --wait-strategy '{"NetworkIdle":{"idle_ms":1500}}'
  • --render-request-timeout-ms (or CRAWLEX_REQUEST_TIMEOUT_MS env) — CDP command + navigation watcher deadline. Default 30000.
  • --navigation-lifecycle (or CRAWLEX_NAVIGATION_LIFECYCLE env) — load (default) or domcontentloaded. The latter returns as soon as the parser finishes; useful when third-party trackers keep the load event perpetually pending.

Known limitations

  • Cloudflare Turnstile: Some challenge variants gate on solving an in-page CAPTCHA — crawlex does not solve CAPTCHAs.
  • robots.txt: Parsed (RFC-compliant via texting_robots) but not yet auto-enforced inside the crawler loop. Set --respect-robots to opt in, or check manually via crawlex robots https://....
  • cargo install crawlex: Pseudo-header order patch is not carried (see Install note above). Use GitHub Release binaries or cargo build from source for the full H2 fingerprint.
  • HTTP/3 + QUIC: Not yet supported. HTTP/2 over TLS only.

Development

# Clone with submodules (vendored h2 fork lives at vendor/h2).
git clone --recurse-submodules https://github.com/forattini-dev/crawlex
cd crawlex

# Run unit tests + the offline shim compliance suite.
cargo test --lib
cargo test --test fpjs_compliance

# Live tests (require system Chromium):
cargo test --all-features --test stealth_runtime_live -- --ignored
cargo test --all-features --test worker_shim_live -- --ignored

# Format / lint:
cargo fmt
cargo clippy --all-features -- -D warnings

License

Dual-licensed under either of:

at your option. SPDX: MIT OR Apache-2.0.

Third-party attribution: see NOTICE.