crawlex 1.0.3

Stealth crawler with Chrome-perfect TLS/H2 fingerprint, render pool, hooks, persistent queue
Documentation

๐Ÿ•ธ๏ธ crawlex

The stealth crawler that actually looks like Chrome.

TLS, HTTP/2, JS fingerprint โ€” every byte indistinguishable from real Chrome 149. Rust core โ€ข Node SDK โ€ข Lua hooks โ€ข cross-platform binaries.

CI crates.io npm docs downloads license

pnpm add -g crawlex && crawlex pages run --seed https://example.com --method render

Quickstart ยท Features ยท Examples ยท Docs ยท Why crawlex


โšก Why crawlex

Standard crawlers fail on the first Cloudflare wall. crawlex arrives the way real Chrome arrives โ€” every fingerprint surface is identical, not approximated.

โ†’ Validated against BrowserScan, CreepJS, Sannysoft, tls.peet.ws, ja4db.com.


๐Ÿš€ Install

# npm โ€” bundled binary download via postinstall
pnpm add -g crawlex

# Rust โ€” from source
cargo install crawlex

# Direct binary (linux x86_64/arm64, macOS x86_64/arm64, windows x86_64)
# https://github.com/forattini-dev/crawlex/releases/latest

โš ๏ธ Production crawls run locally, never in CI. Datacenter IPs (GitHub Actions, AWS, Azure) are flagged instantly by every modern WAF.


๐Ÿƒ Quickstart

# Stealth render with persona, sitemap discovery, NDJSON event stream
crawlex pages run \
  --seed https://target.com \
  --method render \
  --persona atlas \
  --max-depth 3 \
  --screenshot \
  --emit ndjson > events.ndjson

# Live tail what just happened
jq -c 'select(.event == "fetch.completed" or .event == "render.completed")' events.ndjson

Three integration paths, your pick:

crawlex pages run \
  --seed https://...\
  --method render \
  --persona pixel \
  --emit ndjson

One-shot crawls, scripted pipelines.

import { crawl, defineHooks } from 'crawlex';

for await (const ev of crawl({
  seeds: ['https://...'],
  args: { method: 'render' },
})) { ... }

Production services with hook logic.

use crawlex::{Crawler, Config};
let crawler = Crawler::new(
    Config::builder().build()?
)?;
crawler.run().await?;

In-process embedding, zero IPC.


๐ŸŽจ Examples

1. Hunt a SaaS product page with vitals + screenshot

import { crawl } from 'crawlex';

for await (const ev of crawl({
  seeds: ['https://stripe.com/pricing'],
  args: {
    method: 'render',
    persona: 'atlas',                 // macOS Apple M1, Retina, en-US
    screenshot: true,
    screenshotMode: 'fullpage',
    storage: 'filesystem',
    storagePath: './out',
    waitStrategy: '{"NetworkIdle":{"idle_ms":1500}}',
  },
})) {
  if (!('event' in ev)) continue;
  switch (ev.event) {
    case 'render.completed':
      console.log(`โœ… ${ev.url} | LCP=${ev.data.vitals.largest_contentful_paint_ms}ms | CLS=${ev.data.vitals.cumulative_layout_shift}`);
      break;
    case 'artifact.saved':
      if (ev.data.kind === 'screenshot.full_page')
        console.log(`๐Ÿ“ธ โ†’ out/${ev.data.path}  (${(ev.data.size/1024).toFixed(0)}kB)`);
      break;
    case 'challenge.detected':
      console.log(`๐Ÿšง ${ev.data.vendor} (${ev.data.level}) on ${ev.url}`);
      break;
  }
}

2. Crawl an entire domain with proxy rotation + retry policy

import { crawl, defineHooks } from 'crawlex';

const hooks = defineHooks({
  // Rate-limit retry: 429/503 โ†’ re-enqueue (up to retry_max)
  async onAfterFirstByte(ctx) {
    if (ctx.response_status === 429 || ctx.response_status === 503) return 'retry';
    return 'continue';
  },
  // Inject the canonical sitemap.xml for every host we touch
  async onDiscovery(ctx) {
    const host = new URL(ctx.url).host;
    return {
      decision: 'continue',
      patch: { capturedUrls: [...ctx.captured_urls, `https://${host}/sitemap.xml`] },
    };
  },
  // Tag the crawl with custom metadata that lands in user_data
  async onJobStart(ctx) {
    return {
      decision: 'continue',
      patch: { userData: { ...ctx.user_data, run_owner: 'qa-bot' } },
    };
  },
});

for await (const ev of crawl({
  seeds: ['https://target.com'],
  args: {
    method: 'auto',                   // policy engine picks http vs render
    maxConcurrentHttp: 8,
    maxConcurrentRender: 2,
    maxDepth: 5,
    crtsh: true,                      // certificate-transparency seeding
    storage: 'sqlite',
    storagePath: './crawl.db',
    queue: 'sqlite',
    queuePath: './crawl.db',
    proxies: ['http://user:pass@proxy1:8080', 'http://user:pass@proxy2:8080'],
    proxyStrategy: 'health-weighted',
    proxyStickyPerHost: true,
  },
  hooks,
  signal: AbortSignal.timeout(30 * 60_000),
})) {
  if (!('event' in ev)) continue;
  if (ev.event === 'job.failed') console.error(`โœ— ${ev.url} โ€” ${ev.data.error}`);
  if (ev.event === 'run.completed') console.log('done.');
}

3. Embedded library with custom Rust hooks

use crawlex::{Config, Crawler, queue::FetchMethod};
use crawlex::hooks::{HookDecision, HookRegistry};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

#[tokio::main]
async fn main() -> crawlex::Result<()> {
    let hooks = HookRegistry::new();
    let pages_seen = Arc::new(AtomicUsize::new(0));

    // Closure-captured counter โ€” observe without intervening
    let counter = pages_seen.clone();
    hooks.on_response_body(move |_ctx| {
        let c = counter.clone();
        Box::pin(async move {
            c.fetch_add(1, Ordering::Relaxed);
            Ok(HookDecision::Continue)
        })
    });

    // Domain-level deny list โ€” short-circuit before fetch
    hooks.on_before_each_request(|ctx| {
        let url = ctx.url.clone();
        Box::pin(async move {
            if url.path().starts_with("/admin/") { return Ok(HookDecision::Skip); }
            Ok(HookDecision::Continue)
        })
    });

    let config = Config::builder()
        .max_concurrent_http(16)
        .build()?;

    let crawler = Crawler::new(config)?.with_hooks(hooks);
    crawler.seed_with(
        vec!["https://target.com".parse().unwrap()],
        FetchMethod::HttpSpoof,
    ).await?;
    crawler.run().await?;

    println!("Crawled {} pages", pages_seen.load(Ordering::Relaxed));
    Ok(())
}

โ†’ Full runnable example: examples/embedded_with_hooks.rs

4. Pin a specific browser fingerprint from the catalog

# Browse 80+ ready-to-use fingerprints
crawlex stealth catalog list
crawlex stealth catalog list --filter chrome
crawlex stealth catalog show chrome-149-linux

# Pin a precise version + OS
crawlex pages run --seed https://target.com \
  --profile chrome-149-linux

# Era fallback: chromium-122 not captured? falls back to closest era + warns
crawlex pages run --seed https://target.com \
  --profile chromium-122-linux

# Mobile persona (touch viewport, sec-ch-ua-mobile: ?1)
crawlex pages run --seed https://target.com \
  --method render --persona pixel

5. Inspect what your stealth stack actually emits

# Print active IdentityBundle + TLS profile summary
crawlex stealth inspect --profile chrome-149-linux

# Verify ALPN/cipher/JA4 against built-in expectations
crawlex stealth test

# Compare against tls.peet.ws / ja4db.com via the live oracle
crawlex stealth catalog show chrome-149-linux --json

๐ŸŽฏ Features

๐Ÿฅท Stealth core

  • ๐Ÿ” Chrome 149 TLS via BoringSSL fork
  • ๐Ÿšฆ H2 pseudo-header order patch
  • ๐ŸŽญ 29-section JS shim โ€” full leak inventory covered
  • ๐Ÿค– Worker scope shim (dedicated / shared / SW)
  • ๐Ÿ“ฆ 80+ browser fingerprints from curl-impersonate + ja4db + tls.peet
  • ๐ŸŒ 5 personas: tux, office, gamer, atlas, pixel
  • ๐ŸŽฌ Coherent motion:: profiles (mouse / scroll / dwell)
  • ๐Ÿ•ธ๏ธ WebRTC scrub (SDP, ICE, getStats โ€” public-interface only)

๐Ÿ” Discovery

  • ๐Ÿ—บ๏ธ Sitemap recursion + robots.txt parsing
  • ๐Ÿ”Ž Certificate transparency (crt.sh)
  • ๐ŸŒ DNS records + RDAP + Wayback CDX
  • ๐Ÿ“œ PWA manifest + service worker probes
  • ๐Ÿ“‚ .well-known/* enumeration
  • ๐Ÿ”ฌ Tech fingerprinting (Wappalyzer-class)
  • ๐Ÿ”Œ JS endpoint extraction from runtime
  • ๐Ÿ›ก๏ธ security.txt parser
  • ๐Ÿงฌ Asset-ref classification (JS / CSS / image / API / nav)
  • ๐Ÿ”“ TCP port scan (opt-in, network-active)

๐Ÿ›ก๏ธ Antibot policy engine

  • ๐Ÿšง Detect: Cloudflare, DataDome, PerimeterX, Akamai BMP, Imperva, hCaptcha, reCAPTCHA, Turnstile
  • ๐Ÿ“Š Vendor telemetry observer (passive โ€” sees outbound calls to known endpoints)
  • ๐Ÿ”„ Policy decisions: keep / drop / retry / scope-demote / proxy-rotate / give-up
  • ๐ŸŽฏ 4 captcha solver adapters: in-house reCAPTCHA v3, 2captcha, anticaptcha, VLM

โš™๏ธ Pipeline

  • ๐ŸŽฏ Render pool โ€” Chromium auto-fetch + isolated user-data dirs
  • ๐Ÿ” Persistent queue: in-memory / SQLite / Redis backends
  • ๐Ÿ’พ Storage: filesystem / SQLite / memory โ€” opt-in per concern (artifact, state, challenge, telemetry, intel)
  • ๐Ÿ”„ Proxy rotator โ€” health checks + sticky sessions + per-host affinity
  • ๐Ÿ“Š Web Vitals + per-fetch network breakdown (DNS / TCP / TLS / TTFB / download)
  • ๐ŸŽฌ ScriptSpec runner โ€” declarative Plan execution with assertions
  • ๐Ÿ”ง Frontier with dedupe + rate-limit + retry policies
  • ๐Ÿ“ Wait strategies: Load, DOMContentLoaded, NetworkIdle, Selector, Fixed

๐Ÿ“ก Observability

  • ๐Ÿ“œ NDJSON event stream โ€” versioned envelope (v: 1)
  • ๐ŸŽฌ 19 event kinds covering full lifecycle
  • ๐Ÿ”ฌ Embedded WebVitals summary on render.completed
  • โฑ๏ธ Per-request timings on fetch.completed (ALPN, cipher, TLS version)
  • ๐Ÿ“ธ Artifact descriptors with on-disk path on the wire
  • ๐Ÿช Hooks: 12 lifecycle points ร— 3 languages (Rust / JS / Lua)
  • ๐Ÿ“Š Prometheus metrics endpoint

๐Ÿ”Œ Integrations

  • ๐Ÿ“ฆ npm + crates.io + GitHub Releases
  • ๐Ÿฆ€ Rust library โ€” embed Crawler directly
  • ๐Ÿ“˜ TypeScript types โ€” strict, full envelope coverage
  • ๐Ÿ”Œ SDK crawl() async iterator
  • ๐Ÿ“š docsify docs site (GitHub Pages)
  • ๐Ÿงช 386+ lib tests, 27 fpjs compliance, TLS catalog roundtrip suite
  • ๐Ÿ” Optional Lua hooks (mlua)
  • ๐Ÿชถ Two binaries: crawlex (full) + crawlex-mini (HTTP-only, no Chromium)

๐Ÿ“ก NDJSON event stream

Every run emits one JSON envelope per line on stdout. Versioned, stable, 19 kinds:

{"v":1,"event":"run.started","ts":"2026-04-26T19:42:00.000Z","run_id":42,"data":{"policy_profile":"strict","max_concurrent_http":8,"max_concurrent_render":2}}
{"v":1,"event":"job.started","run_id":42,"url":"https://target.com/","data":{"job_id":"j_001","method":"render","depth":0,"priority":0,"attempts":0}}
{"v":1,"event":"fetch.completed","run_id":42,"url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"bytes":98234,"body_truncated":false,"dns_ms":12,"tcp_connect_ms":18,"tls_handshake_ms":24,"ttfb_ms":142,"download_ms":83,"total_ms":280,"alpn":"h2","tls_version":"TLSv1.3","cipher":"TLS_AES_128_GCM_SHA256"}}
{"v":1,"event":"render.completed","run_id":42,"session_id":"sess_abc","url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"manifest":true,"service_workers":1,"is_spa":true,"vitals":{"ttfb_ms":142,"first_contentful_paint_ms":380.5,"largest_contentful_paint_ms":920.1,"cumulative_layout_shift":0.03,"total_blocking_time_ms":50.0,"dom_nodes":1842,"js_heap_used_bytes":12345678,"resource_count":45,"total_transfer_bytes":982341}}}
{"v":1,"event":"artifact.saved","run_id":42,"url":"https://target.com/","data":{"kind":"screenshot.full_page","mime":"image/png","size":1234567,"sha256":"a1b2c3...","path":"artifacts/sess_abc/1714123456_screenshot_full_page_a1b2c3d4.png"}}
{"v":1,"event":"challenge.detected","run_id":42,"url":"https://protected.com/","data":{"vendor":"cloudflare_turnstile","level":"widget_present"}}
{"v":1,"event":"decision.made","run_id":42,"url":"https://protected.com/","why":"render:js-challenge","data":{"decision":"retry","reason":{"code":"render:js-challenge"}}}
{"v":1,"event":"run.completed","run_id":42}

Discriminator key: event (snake_case) โ€” TypeScript narrows via switch (ev.event) { โ€ฆ }. Fallback for malformed lines: { kind: 'raw', line } so consumers can log/recover.


๐Ÿช Hooks โ€” 12 lifecycle points ร— 3 languages

before_each_request โ†’ after_dns โ†’ after_tls โ†’ after_first_byte โ†’ on_response_body
   โ†’ after_load โ†’ after_idle โ†’ on_discovery โ†’ on_job_start โ†’ on_job_end
   โ†’ on_error โ†’ on_robots_decision
Language API Best for
Rust hooks.on_after_first_byte(closure) โ€” full &mut HookContext access Embedded library, latency-critical paths
JS / TS defineHooks({...}) via SDK โ€” IPC bridge, async closures Production crawls, business logic
Lua --hook-script foo.lua โ€” page-driving helpers (page_click, page_eval) Ad-hoc scripts, no build step

All three modes return the same decision: continue / skip / retry / abort. Hooks can mutate ctx.captured_urls, inject extra URLs, write to user_data to communicate with downstream hooks, or override robots_allowed.


๐ŸŽญ Personas โ€” coherent identity bundles

Each persona is a complete bundle โ€” UA + Sec-CH-UA + screen + viewport + DPR + GPU + fonts + media-device counts + TLS profile + motion timings โ€” so every signal matches. No mismatched UA + WebGL combo gives you away.

Codename OS GPU Locale Form factor
๐Ÿง tux Linux Intel UHD 630 en-US desktop 1920ร—1080
๐Ÿข office Windows 10 Intel UHD 620 en-US laptop 1920ร—1080 (DPR 1.25)
๐ŸŽฎ gamer Windows 10 NVIDIA GTX 1060 pt-BR desktop 1920ร—1080
๐ŸŽ atlas macOS Apple M1 en-US retina 1440ร—900 (DPR 2.0)
๐Ÿ“ฑ pixel Android 14 Adreno 640 pt-BR mobile 412ร—823 (DPR 2.625)
crawlex pages run --seed https://target.com --persona atlas    # macOS
crawlex pages run --seed https://target.com --persona pixel    # mobile

๐Ÿ—๏ธ Architecture

flowchart LR
  S[Seeds] --> Q[Frontier<br/>+ dedupe + rate-limit]
  Q --> P[Policy Engine]
  P -->|http| F[ImpersonateClient<br/>BoringSSL + h2 patched]
  P -->|render| R[RenderPool<br/>Chromium + stealth shim]
  F --> X[Extractor<br/>+ Asset Refs]
  R --> X
  X --> D[Discovery<br/>Pipeline]
  X --> ST[Storage<br/>5 traits]
  D --> Q
  P --> EV[NDJSON Events<br/>19 kinds]
  R --> H1[Rust Hooks]
  R --> H2[JS Bridge]
  R --> H3[Lua Scripts]

Module map:

  • impersonate/ โ€” TLS catalog + BoringSSL connector + ALPS + GREASE
  • render/ โ€” Chromium pool + 29-section stealth shim + motion engine + ScriptSpec runner
  • discovery/ โ€” 17-stage pipeline (DNS, RDAP, sitemap, robots, crtsh, wayback, well-known, โ€ฆ)
  • policy/ โ€” pure engine: decide_pre_fetch, decide_post_fetch, decide_post_error, decide_post_challenge
  • antibot/ โ€” vendor classifier + 4 captcha solver adapters
  • storage/ โ€” 5 concern-oriented traits (artifact / state / challenge / telemetry / intel)
  • events/ โ€” NDJSON envelope + sink (stdout / null / memory)
  • hooks/ โ€” registry + JS bridge + Lua host

๐Ÿ› ๏ธ Tech stack

Layer Implementation
TLS boring-sys โ€” BoringSSL fork with ALPS / permute_extensions / X25519MLKEM768
HTTP/2 Vendored h2 crate with pseudo-header order patch (vendor/h2)
CDP chromiumoxide-derived, embedded behind cdp-backend feature
Async tokio multi-thread
Storage rusqlite (SQLite WAL), DashMap (memory), filesystem layout
Discovery hickory-resolver (DNS), reqwest (RDAP), texting_robots (robots.txt)
Lua mlua 0.10 (optional, lua-hooks feature)
SDK Node 20+, CommonJS, zero runtime deps

Two binaries ship from one source tree:

  • crawlex โ€” full build with HTTP impersonation + Chromium rendering + stealth shim + persistent queue
  • crawlex-mini โ€” HTTP-only worker, no Chromium dependency, same CLI surface (browser-only flags return Error::RenderDisabled)

๐Ÿ“Š Versus the alternatives

crawlex Playwright stealth Puppeteer + plugins curl-impersonate
TLS-perfect ClientHello โœ… BoringSSL โš ๏ธ relies on Chromium โš ๏ธ relies on Chromium โœ…
H2 pseudo-header order โœ… patched h2 โš ๏ธ Chromium default โš ๏ธ Chromium default โŒ
29-section JS leak coverage โœ… โš ๏ธ partial โš ๏ธ via plugins โŒ no JS
Worker-scope stealth โœ… auto-attach โš ๏ธ manual โš ๏ธ manual โŒ
HTTP-only path (no browser) โœ… crawlex-mini โŒ โŒ โœ…
Persistent queue + resume โœ… SQLite/Redis โŒ external โŒ external โŒ
Discovery pipeline โœ… 17 stages โŒ โŒ โŒ
Streaming NDJSON events โœ… versioned โŒ โŒ โŒ
Rust embedding โœ… โŒ โŒ โš ๏ธ libcurl
Single binary โœ… โŒ โŒ โœ…

๐Ÿ“š Documentation


๐Ÿค Contributing

git clone https://github.com/forattini-dev/crawlex
cd crawlex

# Unit tests + offline shim compliance
cargo test --lib                    # 386+ tests
cargo test --test fpjs_compliance   # 27 cases
cargo test --test tls_catalog_coverage --test tls_catalog_roundtrip

# SDK tests
pnpm test                           # 21 node:test cases

# Quality gates
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo publish --dry-run --locked

# Live integration tests (require system Chromium)
cargo test --all-features --test stealth_runtime_live -- --ignored
cargo test --all-features --test worker_shim_live -- --ignored

CI runs all of the above on every PR. Contributions welcome โ€” issues, feature requests, and PRs all reviewed.


๐Ÿ“„ License

Dual-licensed under MIT OR Apache-2.0 at your option. SPDX: MIT OR Apache-2.0.

Third-party attribution: see NOTICE.


Built for crawlers who refuse to be detected.

Docs ยท Releases ยท Issues ยท Discussions