crawlex 1.0.4

Stealth crawler with Chrome-perfect TLS/H2 fingerprint, render pool, hooks, persistent queue
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
<div align="center">

# ๐Ÿ•ธ๏ธ crawlex

### **The stealth crawler that actually looks like Chrome.**

TLS, HTTP/2, JS fingerprint โ€” every byte indistinguishable from real Chrome 149.<br>
Rust core โ€ข Node SDK โ€ข Lua hooks โ€ข cross-platform binaries.

[![CI](https://github.com/forattini-dev/crawlex/actions/workflows/ci.yml/badge.svg)](https://github.com/forattini-dev/crawlex/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/crawlex.svg?logo=rust)](https://crates.io/crates/crawlex)
[![npm](https://img.shields.io/npm/v/crawlex.svg?logo=npm)](https://www.npmjs.com/package/crawlex)
[![docs](https://img.shields.io/badge/docs-docsify-success.svg)](https://forattini-dev.github.io/crawlex/)
[![downloads](https://img.shields.io/crates/d/crawlex.svg)](https://crates.io/crates/crawlex)
[![license](https://img.shields.io/badge/license-MIT%20%7C%20Apache--2.0-blue.svg)](#license)

```bash
pnpm add -g crawlex && crawlex pages run --seed https://example.com --method render
```

[**Quickstart**](#-quickstart) ยท [**Features**](#-features) ยท [**Examples**](#-examples) ยท [**Docs**](https://forattini-dev.github.io/crawlex/) ยท [**Why crawlex**](#-why-crawlex)

</div>

---

## โšก Why crawlex

Standard crawlers fail on the first Cloudflare wall. `crawlex` arrives the way **real Chrome** arrives โ€” every fingerprint surface is identical, not approximated.

<table>
<tr><th>Layer</th><th>What we match โ€” exactly, not approximately</th></tr>
<tr><td>๐Ÿ” <strong>TLS ClientHello</strong></td><td>Extension order, ALPS, GREASE values, <code>permute_extensions</code>, X25519MLKEM768, signature algorithms โ€” verified against <a href="https://tls.peet.ws">tls.peet.ws</a> and <a href="https://ja4db.com">ja4db.com</a> oracles</td></tr>
<tr><td>๐Ÿšฆ <strong>HTTP/2 frame</strong></td><td>Pseudo-header order <code>:method :authority :scheme :path</code>, SETTINGS frame parameters, WINDOW_UPDATE pattern โ€” passes Akamai BMP signature checks</td></tr>
<tr><td>๐ŸŽญ <strong>JS fingerprint</strong></td><td>29-section stealth shim: <code>navigator</code>, <code>chrome.*</code>, permissions, plugins, screen, timezone, battery, WebGL (vendor / params / extensions), canvas (zero-preserving noise), AudioContext (FFT + offline render), <code>Function.prototype.toString</code> proxy, WebGPU, <code>performance.memory</code>, sensors, iframe, requestAnimationFrame throttle, <code>performance.now()</code> 100ยตs grain, mediaDevices, fonts, WebRTC SDP/ICE/getStats scrub</td></tr>
<tr><td>๐Ÿค– <strong>Behavior</strong></td><td>Mouse jitter, scroll cadence, dwell time, idle drift โ€” coherent <code>motion::</code> profiles per persona</td></tr>
<tr><td>๐Ÿ“ฆ <strong>Catalog</strong></td><td>30 Chrome stable ร— 30 Chromium ร— 20 Firefox ร— Edge ร— Safari fingerprints. Era-fallback resolution: ask for <code>chrome-149-linux</code>, get the closest captured profile</td></tr>
<tr><td>๐Ÿ› ๏ธ <strong>Worker scope</strong></td><td>Same shim auto-attached to dedicated / shared / service workers via CDP <code>Target.setAutoAttach</code> โ€” Camoufox port</td></tr>
</table>

โ†’ Validated against [BrowserScan](https://browserscan.net), [CreepJS](https://abrahamjuliot.github.io/creepjs/), [Sannysoft](https://bot.sannysoft.com/), [tls.peet.ws](https://tls.peet.ws), [ja4db.com](https://ja4db.com).

---

## ๐Ÿš€ Install

```bash
# npm โ€” bundled binary download via postinstall
pnpm add -g crawlex

# Rust โ€” from source
cargo install crawlex

# Direct binary (linux x86_64/arm64, macOS x86_64/arm64, windows x86_64)
# https://github.com/forattini-dev/crawlex/releases/latest
```

> โš ๏ธ **Production crawls run locally**, never in CI. Datacenter IPs (GitHub Actions, AWS, Azure) are flagged instantly by every modern WAF.

---

## ๐Ÿƒ Quickstart

```bash
# Stealth render with persona, sitemap discovery, NDJSON event stream
crawlex pages run \
  --seed https://target.com \
  --method render \
  --persona atlas \
  --max-depth 3 \
  --screenshot \
  --emit ndjson > events.ndjson

# Live tail what just happened
jq -c 'select(.event == "fetch.completed" or .event == "render.completed")' events.ndjson
```

Three integration paths, your pick:

<table>
<tr><th>CLI</th><th>Node SDK</th><th>Embedded Rust</th></tr>
<tr><td>

```bash
crawlex pages run \
  --seed https://...\
  --method render \
  --persona pixel \
  --emit ndjson
```

One-shot crawls, scripted pipelines.

</td><td>

```ts
import { crawl, defineHooks } from 'crawlex';

for await (const ev of crawl({
  seeds: ['https://...'],
  args: { method: 'render' },
})) { ... }
```

Production services with hook logic.

</td><td>

```rust
use crawlex::{Crawler, Config};
let crawler = Crawler::new(
    Config::builder().build()?
)?;
crawler.run().await?;
```

In-process embedding, zero IPC.

</td></tr>
</table>

---

## ๐ŸŽจ Examples

### 1. Hunt a SaaS product page with vitals + screenshot

```ts
import { crawl } from 'crawlex';

for await (const ev of crawl({
  seeds: ['https://stripe.com/pricing'],
  args: {
    method: 'render',
    persona: 'atlas',                 // macOS Apple M1, Retina, en-US
    screenshot: true,
    screenshotMode: 'fullpage',
    storage: 'filesystem',
    storagePath: './out',
    waitStrategy: '{"NetworkIdle":{"idle_ms":1500}}',
  },
})) {
  if (!('event' in ev)) continue;
  switch (ev.event) {
    case 'render.completed':
      console.log(`โœ… ${ev.url} | LCP=${ev.data.vitals.largest_contentful_paint_ms}ms | CLS=${ev.data.vitals.cumulative_layout_shift}`);
      break;
    case 'artifact.saved':
      if (ev.data.kind === 'screenshot.full_page')
        console.log(`๐Ÿ“ธ โ†’ out/${ev.data.path}  (${(ev.data.size/1024).toFixed(0)}kB)`);
      break;
    case 'challenge.detected':
      console.log(`๐Ÿšง ${ev.data.vendor} (${ev.data.level}) on ${ev.url}`);
      break;
  }
}
```

### 2. Crawl an entire domain with proxy rotation + retry policy

```ts
import { crawl, defineHooks } from 'crawlex';

const hooks = defineHooks({
  // Rate-limit retry: 429/503 โ†’ re-enqueue (up to retry_max)
  async onAfterFirstByte(ctx) {
    if (ctx.response_status === 429 || ctx.response_status === 503) return 'retry';
    return 'continue';
  },
  // Inject the canonical sitemap.xml for every host we touch
  async onDiscovery(ctx) {
    const host = new URL(ctx.url).host;
    return {
      decision: 'continue',
      patch: { capturedUrls: [...ctx.captured_urls, `https://${host}/sitemap.xml`] },
    };
  },
  // Tag the crawl with custom metadata that lands in user_data
  async onJobStart(ctx) {
    return {
      decision: 'continue',
      patch: { userData: { ...ctx.user_data, run_owner: 'qa-bot' } },
    };
  },
});

for await (const ev of crawl({
  seeds: ['https://target.com'],
  args: {
    method: 'auto',                   // policy engine picks http vs render
    maxConcurrentHttp: 8,
    maxConcurrentRender: 2,
    maxDepth: 5,
    crtsh: true,                      // certificate-transparency seeding
    storage: 'sqlite',
    storagePath: './crawl.db',
    queue: 'sqlite',
    queuePath: './crawl.db',
    proxies: ['http://user:pass@proxy1:8080', 'http://user:pass@proxy2:8080'],
    proxyStrategy: 'health-weighted',
    proxyStickyPerHost: true,
  },
  hooks,
  signal: AbortSignal.timeout(30 * 60_000),
})) {
  if (!('event' in ev)) continue;
  if (ev.event === 'job.failed') console.error(`โœ— ${ev.url} โ€” ${ev.data.error}`);
  if (ev.event === 'run.completed') console.log('done.');
}
```

### 3. Embedded library with custom Rust hooks

```rust
use crawlex::{Config, Crawler, queue::FetchMethod};
use crawlex::hooks::{HookDecision, HookRegistry};
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::Arc;

#[tokio::main]
async fn main() -> crawlex::Result<()> {
    let hooks = HookRegistry::new();
    let pages_seen = Arc::new(AtomicUsize::new(0));

    // Closure-captured counter โ€” observe without intervening
    let counter = pages_seen.clone();
    hooks.on_response_body(move |_ctx| {
        let c = counter.clone();
        Box::pin(async move {
            c.fetch_add(1, Ordering::Relaxed);
            Ok(HookDecision::Continue)
        })
    });

    // Domain-level deny list โ€” short-circuit before fetch
    hooks.on_before_each_request(|ctx| {
        let url = ctx.url.clone();
        Box::pin(async move {
            if url.path().starts_with("/admin/") { return Ok(HookDecision::Skip); }
            Ok(HookDecision::Continue)
        })
    });

    let config = Config::builder()
        .max_concurrent_http(16)
        .build()?;

    let crawler = Crawler::new(config)?.with_hooks(hooks);
    crawler.seed_with(
        vec!["https://target.com".parse().unwrap()],
        FetchMethod::HttpSpoof,
    ).await?;
    crawler.run().await?;

    println!("Crawled {} pages", pages_seen.load(Ordering::Relaxed));
    Ok(())
}
```

โ†’ Full runnable example: [`examples/embedded_with_hooks.rs`](examples/embedded_with_hooks.rs)

### 4. Pin a specific browser fingerprint from the catalog

```bash
# Browse 80+ ready-to-use fingerprints
crawlex stealth catalog list
crawlex stealth catalog list --filter chrome
crawlex stealth catalog show chrome-149-linux

# Pin a precise version + OS
crawlex pages run --seed https://target.com \
  --profile chrome-149-linux

# Era fallback: chromium-122 not captured? falls back to closest era + warns
crawlex pages run --seed https://target.com \
  --profile chromium-122-linux

# Mobile persona (touch viewport, sec-ch-ua-mobile: ?1)
crawlex pages run --seed https://target.com \
  --method render --persona pixel
```

### 5. Inspect what your stealth stack actually emits

```bash
# Print active IdentityBundle + TLS profile summary
crawlex stealth inspect --profile chrome-149-linux

# Verify ALPN/cipher/JA4 against built-in expectations
crawlex stealth test

# Compare against tls.peet.ws / ja4db.com via the live oracle
crawlex stealth catalog show chrome-149-linux --json
```

---

## ๐ŸŽฏ Features

<table>
<tr>
<td width="50%" valign="top">

### ๐Ÿฅท Stealth core
- ๐Ÿ” Chrome 149 TLS via BoringSSL fork
- ๐Ÿšฆ H2 pseudo-header order patch
- ๐ŸŽญ 29-section JS shim โ€” full leak inventory covered
- ๐Ÿค– Worker scope shim (dedicated / shared / SW)
- ๐Ÿ“ฆ 80+ browser fingerprints from curl-impersonate + ja4db + tls.peet
- ๐ŸŒ 5 personas: `tux`, `office`, `gamer`, `atlas`, `pixel`
- ๐ŸŽฌ Coherent `motion::` profiles (mouse / scroll / dwell)
- ๐Ÿ•ธ๏ธ WebRTC scrub (SDP, ICE, getStats โ€” public-interface only)

### ๐Ÿ” Discovery
- ๐Ÿ—บ๏ธ Sitemap recursion + robots.txt parsing
- ๐Ÿ”Ž Certificate transparency (crt.sh)
- ๐ŸŒ DNS records + RDAP + Wayback CDX
- ๐Ÿ“œ PWA manifest + service worker probes
- ๐Ÿ“‚ `.well-known/*` enumeration
- ๐Ÿ”ฌ Tech fingerprinting (Wappalyzer-class)
- ๐Ÿ”Œ JS endpoint extraction from runtime
- ๐Ÿ›ก๏ธ security.txt parser
- ๐Ÿงฌ Asset-ref classification (JS / CSS / image / API / nav)
- ๐Ÿ”“ TCP port scan (opt-in, network-active)

### ๐Ÿ›ก๏ธ Antibot policy engine
- ๐Ÿšง Detect: Cloudflare, DataDome, PerimeterX, Akamai BMP, Imperva, hCaptcha, reCAPTCHA, Turnstile
- ๐Ÿ“Š Vendor telemetry observer (passive โ€” sees outbound calls to known endpoints)
- ๐Ÿ”„ Policy decisions: keep / drop / retry / scope-demote / proxy-rotate / give-up
- ๐ŸŽฏ 4 captcha solver adapters: in-house reCAPTCHA v3, 2captcha, anticaptcha, VLM

</td>
<td width="50%" valign="top">

### โš™๏ธ Pipeline
- ๐ŸŽฏ Render pool โ€” Chromium auto-fetch + isolated user-data dirs
- ๐Ÿ” Persistent queue: in-memory / SQLite / Redis backends
- ๐Ÿ’พ Storage: filesystem / SQLite / memory โ€” opt-in per concern (artifact, state, challenge, telemetry, intel)
- ๐Ÿ”„ Proxy rotator โ€” health checks + sticky sessions + per-host affinity
- ๐Ÿ“Š Web Vitals + per-fetch network breakdown (DNS / TCP / TLS / TTFB / download)
- ๐ŸŽฌ ScriptSpec runner โ€” declarative `Plan` execution with assertions
- ๐Ÿ”ง Frontier with dedupe + rate-limit + retry policies
- ๐Ÿ“ Wait strategies: `Load`, `DOMContentLoaded`, `NetworkIdle`, `Selector`, `Fixed`

### ๐Ÿ“ก Observability
- ๐Ÿ“œ NDJSON event stream โ€” versioned envelope (`v: 1`)
- ๐ŸŽฌ 19 event kinds covering full lifecycle
- ๐Ÿ”ฌ Embedded `WebVitals` summary on `render.completed`
- โฑ๏ธ Per-request timings on `fetch.completed` (ALPN, cipher, TLS version)
- ๐Ÿ“ธ Artifact descriptors with on-disk path on the wire
- ๐Ÿช Hooks: 12 lifecycle points ร— 3 languages (Rust / JS / Lua)
- ๐Ÿ“Š Prometheus metrics endpoint

### ๐Ÿ”Œ Integrations
- ๐Ÿ“ฆ npm + crates.io + GitHub Releases
- ๐Ÿฆ€ Rust library โ€” embed `Crawler` directly
- ๐Ÿ“˜ TypeScript types โ€” strict, full envelope coverage
- ๐Ÿ”Œ SDK `crawl()` async iterator
- ๐Ÿ“š docsify docs site (GitHub Pages)
- ๐Ÿงช 386+ lib tests, 27 fpjs compliance, TLS catalog roundtrip suite
- ๐Ÿ” Optional Lua hooks (`mlua`)
- ๐Ÿชถ Two binaries: `crawlex` (full) + `crawlex-mini` (HTTP-only, no Chromium)

</td>
</tr>
</table>

---

## ๐Ÿ“ก NDJSON event stream

Every run emits one JSON envelope per line on stdout. Versioned, stable, 19 kinds:

```jsonl
{"v":1,"event":"run.started","ts":"2026-04-26T19:42:00.000Z","run_id":42,"data":{"policy_profile":"strict","max_concurrent_http":8,"max_concurrent_render":2}}
{"v":1,"event":"job.started","run_id":42,"url":"https://target.com/","data":{"job_id":"j_001","method":"render","depth":0,"priority":0,"attempts":0}}
{"v":1,"event":"fetch.completed","run_id":42,"url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"bytes":98234,"body_truncated":false,"dns_ms":12,"tcp_connect_ms":18,"tls_handshake_ms":24,"ttfb_ms":142,"download_ms":83,"total_ms":280,"alpn":"h2","tls_version":"TLSv1.3","cipher":"TLS_AES_128_GCM_SHA256"}}
{"v":1,"event":"render.completed","run_id":42,"session_id":"sess_abc","url":"https://target.com/","data":{"final_url":"https://target.com/","status":200,"manifest":true,"service_workers":1,"is_spa":true,"vitals":{"ttfb_ms":142,"first_contentful_paint_ms":380.5,"largest_contentful_paint_ms":920.1,"cumulative_layout_shift":0.03,"total_blocking_time_ms":50.0,"dom_nodes":1842,"js_heap_used_bytes":12345678,"resource_count":45,"total_transfer_bytes":982341}}}
{"v":1,"event":"artifact.saved","run_id":42,"url":"https://target.com/","data":{"kind":"screenshot.full_page","mime":"image/png","size":1234567,"sha256":"a1b2c3...","path":"artifacts/sess_abc/1714123456_screenshot_full_page_a1b2c3d4.png"}}
{"v":1,"event":"challenge.detected","run_id":42,"url":"https://protected.com/","data":{"vendor":"cloudflare_turnstile","level":"widget_present"}}
{"v":1,"event":"decision.made","run_id":42,"url":"https://protected.com/","why":"render:js-challenge","data":{"decision":"retry","reason":{"code":"render:js-challenge"}}}
{"v":1,"event":"run.completed","run_id":42}
```

**Discriminator key:** `event` (snake_case) โ€” TypeScript narrows via `switch (ev.event) { โ€ฆ }`. Fallback for malformed lines: `{ kind: 'raw', line }` so consumers can log/recover.

---

## ๐Ÿช Hooks โ€” 12 lifecycle points ร— 3 languages

```
before_each_request โ†’ after_dns โ†’ after_tls โ†’ after_first_byte โ†’ on_response_body
   โ†’ after_load โ†’ after_idle โ†’ on_discovery โ†’ on_job_start โ†’ on_job_end
   โ†’ on_error โ†’ on_robots_decision
```

| Language | API | Best for |
|---|---|---|
| **Rust** | `hooks.on_after_first_byte(closure)` โ€” full `&mut HookContext` access | Embedded library, latency-critical paths |
| **JS / TS** | `defineHooks({...})` via SDK โ€” IPC bridge, async closures | Production crawls, business logic |
| **Lua** | `--hook-script foo.lua` โ€” page-driving helpers (`page_click`, `page_eval`) | Ad-hoc scripts, no build step |

**All three modes return the same decision:** `continue` / `skip` / `retry` / `abort`. Hooks can mutate `ctx.captured_urls`, inject extra URLs, write to `user_data` to communicate with downstream hooks, or override `robots_allowed`.

---

## ๐ŸŽญ Personas โ€” coherent identity bundles

Each persona is a complete bundle โ€” UA + Sec-CH-UA + screen + viewport + DPR + GPU + fonts + media-device counts + TLS profile + motion timings โ€” so every signal **matches**. No mismatched UA + WebGL combo gives you away.

| Codename | OS | GPU | Locale | Form factor |
|---|---|---|---|---|
| ๐Ÿง `tux` | Linux | Intel UHD 630 | en-US | desktop 1920ร—1080 |
| ๐Ÿข `office` | Windows 10 | Intel UHD 620 | en-US | laptop 1920ร—1080 (DPR 1.25) |
| ๐ŸŽฎ `gamer` | Windows 10 | NVIDIA GTX 1060 | pt-BR | desktop 1920ร—1080 |
| ๐ŸŽ `atlas` | macOS | Apple M1 | en-US | retina 1440ร—900 (DPR 2.0) |
| ๐Ÿ“ฑ `pixel` | Android 14 | Adreno 640 | pt-BR | **mobile** 412ร—823 (DPR 2.625) |

```bash
crawlex pages run --seed https://target.com --persona atlas    # macOS
crawlex pages run --seed https://target.com --persona pixel    # mobile
```

---

## ๐Ÿ—๏ธ Architecture

```mermaid
flowchart LR
  S[Seeds] --> Q[Frontier<br/>+ dedupe + rate-limit]
  Q --> P[Policy Engine]
  P -->|http| F[ImpersonateClient<br/>BoringSSL + h2 patched]
  P -->|render| R[RenderPool<br/>Chromium + stealth shim]
  F --> X[Extractor<br/>+ Asset Refs]
  R --> X
  X --> D[Discovery<br/>Pipeline]
  X --> ST[Storage<br/>5 traits]
  D --> Q
  P --> EV[NDJSON Events<br/>19 kinds]
  R --> H1[Rust Hooks]
  R --> H2[JS Bridge]
  R --> H3[Lua Scripts]
```

**Module map:**
- `impersonate/` โ€” TLS catalog + BoringSSL connector + ALPS + GREASE
- `render/` โ€” Chromium pool + 29-section stealth shim + motion engine + ScriptSpec runner
- `discovery/` โ€” 17-stage pipeline (DNS, RDAP, sitemap, robots, crtsh, wayback, well-known, โ€ฆ)
- `policy/` โ€” pure engine: `decide_pre_fetch`, `decide_post_fetch`, `decide_post_error`, `decide_post_challenge`
- `antibot/` โ€” vendor classifier + 4 captcha solver adapters
- `storage/` โ€” 5 concern-oriented traits (artifact / state / challenge / telemetry / intel)
- `events/` โ€” NDJSON envelope + sink (stdout / null / memory)
- `hooks/` โ€” registry + JS bridge + Lua host

---

## ๐Ÿ› ๏ธ Tech stack

| Layer | Implementation |
|---|---|
| TLS | `boring-sys` โ€” BoringSSL fork with ALPS / permute_extensions / X25519MLKEM768 |
| HTTP/2 | Vendored `h2` crate with pseudo-header order patch (`vendor/h2`) |
| CDP | chromiumoxide-derived, embedded behind `cdp-backend` feature |
| Async | tokio multi-thread |
| Storage | rusqlite (SQLite WAL), DashMap (memory), filesystem layout |
| Discovery | hickory-resolver (DNS), reqwest (RDAP), texting_robots (robots.txt) |
| Lua | mlua 0.10 (optional, `lua-hooks` feature) |
| SDK | Node 20+, CommonJS, zero runtime deps |

**Two binaries** ship from one source tree:
- `crawlex` โ€” **full** build with HTTP impersonation + Chromium rendering + stealth shim + persistent queue
- `crawlex-mini` โ€” **HTTP-only** worker, no Chromium dependency, same CLI surface (browser-only flags return `Error::RenderDisabled`)

---

## ๐Ÿ“Š Versus the alternatives

| | crawlex | Playwright stealth | Puppeteer + plugins | curl-impersonate |
|---|:-:|:-:|:-:|:-:|
| TLS-perfect ClientHello | โœ… BoringSSL | โš ๏ธ relies on Chromium | โš ๏ธ relies on Chromium | โœ… |
| H2 pseudo-header order | โœ… patched h2 | โš ๏ธ Chromium default | โš ๏ธ Chromium default | โŒ |
| 29-section JS leak coverage | โœ… | โš ๏ธ partial | โš ๏ธ via plugins | โŒ no JS |
| Worker-scope stealth | โœ… auto-attach | โš ๏ธ manual | โš ๏ธ manual | โŒ |
| HTTP-only path (no browser) | โœ… `crawlex-mini` | โŒ | โŒ | โœ… |
| Persistent queue + resume | โœ… SQLite/Redis | โŒ external | โŒ external | โŒ |
| Discovery pipeline | โœ… 17 stages | โŒ | โŒ | โŒ |
| Streaming NDJSON events | โœ… versioned | โŒ | โŒ | โŒ |
| Rust embedding | โœ… | โŒ | โŒ | โš ๏ธ libcurl |
| Single binary | โœ… | โŒ | โŒ | โœ… |

---

## ๐Ÿ“š Documentation

- ๐ŸŒ **[forattini-dev.github.io/crawlex](https://forattini-dev.github.io/crawlex/)** โ€” full docsify hub
- ๐Ÿ—๏ธ [Architecture overview](https://forattini-dev.github.io/crawlex/#/architecture/00-overview)
- ๐Ÿ“– [CLI reference](https://forattini-dev.github.io/crawlex/#/reference/cli)
- โš™๏ธ [Config JSON schema](https://forattini-dev.github.io/crawlex/#/reference/config)
- ๐Ÿ“ก [NDJSON event envelope](https://forattini-dev.github.io/crawlex/#/reference/events)
- ๐ŸŽฏ [Guides](https://forattini-dev.github.io/crawlex/#/guides/) โ€” HTTP-only, rendered sessions, persistent runs
- ๐Ÿฅท [Stealth & proxies](https://forattini-dev.github.io/crawlex/#/features/proxy-stealth)

---

## ๐Ÿค Contributing

```bash
git clone https://github.com/forattini-dev/crawlex
cd crawlex

# Unit tests + offline shim compliance
cargo test --lib                    # 386+ tests
cargo test --test fpjs_compliance   # 27 cases
cargo test --test tls_catalog_coverage --test tls_catalog_roundtrip

# SDK tests
pnpm test                           # 21 node:test cases

# Quality gates
cargo fmt --check
cargo clippy --all-features -- -D warnings
cargo publish --dry-run --locked

# Live integration tests (require system Chromium)
cargo test --all-features --test stealth_runtime_live -- --ignored
cargo test --all-features --test worker_shim_live -- --ignored
```

CI runs all of the above on every PR. Contributions welcome โ€” issues, feature requests, and PRs all reviewed.

---

## ๐Ÿ“„ License

Dual-licensed under **MIT OR Apache-2.0** at your option. SPDX: `MIT OR Apache-2.0`.

Third-party attribution: see [`NOTICE`](NOTICE).

---

<div align="center">

<sub>**Built for crawlers who refuse to be detected.**</sub>

[Docs](https://forattini-dev.github.io/crawlex/) ยท [Releases](https://github.com/forattini-dev/crawlex/releases) ยท [Issues](https://github.com/forattini-dev/crawlex/issues) ยท [Discussions](https://github.com/forattini-dev/crawlex/discussions)

</div>