crawlex

Stealth crawler with Chrome-perfect TLS / HTTP-2 fingerprint, render pool, hooks, and persistent queue. Written in Rust.

crawlex ships two binaries:

crawlex — full build with HTTP impersonation, headless Chromium rendering, stealth shim, and persistent queue.
crawlex-mini — HTTP-only worker (no Chromium dependency, same CLI).

It's designed to crawl real production targets that block typical headless crawlers — Cloudflare, DataDome, PerimeterX, Akamai BMP — by matching the TLS extension order, HTTP/2 SETTINGS frame, pseudo-header order, and JS fingerprint surface of stock desktop Chrome.

Install

Cargo

cargo install crawlex

⚠️ The [patch.crates-io] H2 fork is not carried by cargo install — the binary you get from crates.io has TLS-perfect fingerprint but uses upstream h2 = 0.4 pseudo-header order. For full Chrome-149 H2 match, use the GitHub Release binaries below or build from source.

npm (Node wrapper)

pnpm add -g crawlex
# or: npm i -g crawlex

The npm package includes a postinstall step that downloads the prebuilt binary for your platform from the GitHub Release.

Pre-built binary

Linux x86_64, Linux aarch64, macOS x86_64, macOS aarch64, Windows x86_64 binaries (with checksums) are attached to each GitHub Release.

Build from source

git clone --recurse-submodules https://github.com/forattini-dev/crawlex
cd crawlex
cargo build --release --features cli,sqlite,cdp-backend
./target/release/crawlex --help

Requires Rust 1.91 or newer.

Quickstart

# Crawl a single page (HTTP only, no browser).
crawlex pages https://example.com

# Crawl with browser rendering, stealth shim, persistent SQLite queue.
crawlex crawl https://example.com \
  --max-depth 3 \
  --concurrency 4 \
  --queue sqlite:///tmp/crawlex.db \
  --render

# Inspect the active TLS / HTTP-2 fingerprint that crawlex sends.
crawlex fingerprint

# Re-render an existing crawl and show the fingerprint stack used.
crawlex stealth probe https://nowsecure.nl

See docs/ for the full reference (CLI flags, configuration, hooks, events, queue backends, proxy rotation, identity bundles, render pool tuning).

What's in the box

Concern	What we cover
TLS	Chrome 149 ClientHello — extensions ordered like Chrome, BoringSSL ALPS, `permute_extensions` for ChaCha vs AES, GREASE values
HTTP/2	Pseudo-header order `(method, authority, scheme, path)` — Akamai BMP signature match. SETTINGS frame parameters in Chrome's order
JS fingerprint	29-section stealth shim: navigator, chrome.*, permissions, plugins, screen, timezone, battery, connection, matchMedia, WebGL (vendor / params / extensions / `isEnabled`), canvas (zero-preserving noise), AudioContext (FFT + offline render), `Function.prototype.toString`, WebGPU, `performance.memory`, sensors, iframe, window geometry, requestAnimationFrame throttle, `performance.now()` 100µs grain + jitter, sample-rate pinning, mediaDevices coherent counts, speechSynthesis voices per OS, font list, worker concurrency cap, TextMetrics (HarfBuzz analogue), WebRTC SDP / ICE / getStats scrub
Worker scope	Same shim auto-attached to dedicated / shared / service workers via CDP `Target.setAutoAttach` + `Runtime.evaluate` (Camoufox port)
Render pool	Headless Chromium pool with isolated user-data dirs, persona pinning, optional Lua hooks, request interception
Queue	In-memory, SQLite, or Redis persistent backends with DEDUPE + RATE-LIMIT + RETRY policies
Proxy	Rotator with health-checks, sticky sessions, public-interface-only WebRTC policy
Discovery	Sitemap, robots, certificate transparency, DNS, Wayback, security.txt, well-known, JS endpoint extraction, network probe

Real-world coverage

Last validated: see production-validation/summary.md.

Personas (mobile vs desktop)

Five named personas ship in the catalog, each one a coherent bundle of UA

Sec-CH-UA + screen + viewport + DPR + GPU + fonts + media-device counts + TLS fingerprint:

Codename	OS	GPU	Locale	Form factor
`tux`	Linux	Intel UHD 630	en-US	desktop 1920×1080
`office`	Windows 10	Intel UHD 620	en-US	laptop 1920×1080 (DPR 1.25)
`gamer`	Windows 10	NVIDIA GTX 1060	pt-BR	desktop 1920×1080
`atlas`	macOS	Apple M1	en-US	retina 1440×900 (DPR 2.0)
`pixel`	Android 14	Adreno 640	pt-BR	mobile 412×823 (DPR 2.625)

Pick via CLI:

# Mobile Pixel-class persona (touch viewport, mobile UA, sec-ch-ua-mobile: ?1)
crawlex pages run --seed https://example.com \
  --method render --persona pixel

# macOS Apple M1 (Retina DPR 2.0, en-US)
crawlex pages run --seed https://example.com \
  --method render --persona atlas

# Windows NVIDIA gaming desktop (pt-BR)
crawlex pages run --seed https://example.com --persona gamer

The legacy --identity-preset <N> numeric form still works, but the codename is recommended — it stays stable even if catalog ordering shifts.

Browser fingerprint catalog

crawlex ships a TLS fingerprint catalog covering the last 30 Chrome stable, 30 Chromium, and 20 Firefox majors plus Edge and Safari. Choose the persona via --profile <browser>-<major>-<os>:

crawlex pages run --seed https://example.com \
  --method spoof \
  --profile chrome-149-linux

crawlex pages run --seed https://example.com \
  --profile firefox-130-macos

crawlex pages run --seed https://example.com \
  --profile chromium-122-linux

The catalog is data-driven: curl-impersonate's MPL-2.0 vendored signatures live at references/curl-impersonate/tests/signatures/*.yaml, our own captures land at src/impersonate/catalog/captured/*.yaml, and public mining oracles (tls.peet.ws + ja4db.com) populate src/impersonate/catalog/mined/*.json. All are compiled into a static registry by build.rs at compile time.

Discover what's available:

# List every fingerprint registered in the catalog
crawlex stealth catalog list

# Filter by browser family
crawlex stealth catalog list --filter chrome

# Show full ClientHello breakdown for a profile
crawlex stealth catalog show chrome-149-linux
crawlex stealth catalog show firefox-130-macos --json

Era fallback: when an exact (browser, major, os) tuple isn't yet captured, the catalog falls back to the closest era's representative profile and emits a tracing::warn so operators know an approximation is in play. Chrome eras:

Era	Majors	Marker change
E1	98-99	Pre-permute_extensions
E2	100-103	permute_extensions enabled
E3a	104-110	post-quantum experimentation start
E3b	111-116	curl-impersonate frontier
E4	117-123	X25519Kyber768Draft00
E5	124-131	ALPS payload reformat
E6	132-141	MLKEM768 (Kyber removed)
E7	142+	ECH wider deployment

Firefox eras (NSS-based, separate connector path):

Era	Majors	Marker change
ESR	91	NSS 3.68 ESR baseline
F-A	92-95	TLS 1.3 default, Kyber off
F-B	96-100	encrypted_client_hello staged
F-C	101-108	NSS 3.79 base
F-D	109-116	session_ticket reordered
FF-A	117-119	NSS 3.79+ stabilised
FF-B	120-126	KyberSlash mitigation
FF-C	127-130	TLS Encrypted ClientHello experimental

Refreshing the catalog with new captures:

# Pull JA3/JA4 hashes from public databases (validation oracles)
node scripts/mine-fingerprints.mjs

# Bulk-capture every browser version (downloads + headless launch + emit YAML)
bash scripts/sync-tls-catalog.sh

# Subset capture
CHROME_MAJORS="148 149" bash scripts/sync-tls-catalog.sh

Captcha solving

crawlex ships prevention-first captcha handling — the policy engine prefers avoiding challenges over solving them. When a challenge does fire, four solver adapters are available:

Adapter	Vendors	Configuration
`recaptcha-invisible`	reCAPTCHA v3 (vanilla)	none — server-side, in-house
`2captcha`	reCAPTCHA, hCaptcha, image	`CRAWLEX_SOLVER_2CAPTCHA_KEY`
`anticaptcha`	reCAPTCHA, hCaptcha, FunCaptcha	`CRAWLEX_SOLVER_ANTICAPTCHA_KEY`
`vlm`	hCaptcha, image puzzles	`CRAWLEX_SOLVER_VLM_PROVIDER` + key

Pick via CLI:

crawlex pages run --seed https://example.com \
  --method render \
  --captcha-solver recaptcha-invisible

# Externals refuse to run until their API key env var is set —
# this is by design, no silent paid-API calls.
CRAWLEX_SOLVER_2CAPTCHA_KEY=abc123 crawlex pages run \
  --seed https://example.com --captcha-solver 2captcha

In-house reCAPTCHA v3 invisible solver

recaptcha-invisible is a port of h4ckf0r0day/reCaptchaV3-Invisible-Solver adapted to crawlex's identity stack. Key differences from the reference:

Identity coherence — UA, UA-CH brands, screen, timezone, canvas hash, WebGL strings flow from the active IdentityBundle instead of hardcoded Chrome 136 / Windows defaults. Eliminates the cross-check failure mode the reference suffers when the persona disagrees with the synthesised oz payload.
Server-side only — no headless browser launched per challenge. Pipeline: api.js → anchor → reload → regex out the token → return as g-recaptcha-response.
Empirical scoring 0.3-0.9 — same range the reference reports. Use as a fallback for sites where a real browser path isn't viable (rate limit, bandwidth budget). Production use should still prefer a real Chrome render via --method render for tougher detectors.

The solver lives at src/antibot/recaptcha/:

recaptcha/
├── proto.rs       # Minimal protobuf encoder (no prost dep)
├── utils.rs       # base36, cb / co / scramble_oz primitives
├── oz.rs          # Build the `oz` JSON payload
├── telemetry.rs   # Synthesise field-74 client blob (mouse, scroll, perf)
├── solver.rs      # 3-hop pipeline + token extraction
└── adapter.rs     # CaptchaSolver trait impl

Limited to ChallengeVendor::Recaptcha. Not supported: reCAPTCHA Enterprise (anchor-with-action verification), hCaptcha, Cloudflare Turnstile, DataDome, PerimeterX — those are different protocols entirely and fall back to external adapters if configured.

Tuning render timeouts

Heavy real-world targets (Cloudflare-fronted SPAs, ad-laden landing pages) can exceed the default 30 s Page.navigate deadline. Two knobs:

crawlex pages run --seed https://www.example.com \
  --method render \
  --render-request-timeout-ms 60000 \
  --navigation-lifecycle domcontentloaded \
  --wait-strategy '{"NetworkIdle":{"idle_ms":1500}}'

--render-request-timeout-ms (or CRAWLEX_REQUEST_TIMEOUT_MS env) — CDP command + navigation watcher deadline. Default 30000.
--navigation-lifecycle (or CRAWLEX_NAVIGATION_LIFECYCLE env) — load (default) or domcontentloaded. The latter returns as soon as the parser finishes; useful when third-party trackers keep the load event perpetually pending.

Known limitations

Cloudflare Turnstile: Some challenge variants gate on solving an in-page CAPTCHA — crawlex does not solve CAPTCHAs.
robots.txt: Parsed (RFC-compliant via texting_robots) but not yet auto-enforced inside the crawler loop. Set --respect-robots to opt in, or check manually via crawlex robots https://....
cargo install crawlex: Pseudo-header order patch is not carried (see Install note above). Use GitHub Release binaries or cargo build from source for the full H2 fingerprint.
HTTP/3 + QUIC: Not yet supported. HTTP/2 over TLS only.

Development

# Clone with submodules (vendored h2 fork lives at vendor/h2).
git clone --recurse-submodules https://github.com/forattini-dev/crawlex
cd crawlex

# Run unit tests + the offline shim compliance suite.
cargo test --lib
cargo test --test fpjs_compliance

# Live tests (require system Chromium):
cargo test --all-features --test stealth_runtime_live -- --ignored
cargo test --all-features --test worker_shim_live -- --ignored

# Format / lint:
cargo fmt
cargo clippy --all-features -- -D warnings

License

Dual-licensed under either of:

at your option. SPDX: MIT OR Apache-2.0.

Third-party attribution: see NOTICE.

crawlex 1.0.0