exine 0.1.1

Universal Markdown extraction engine. 37+ formats, zero external dependencies, 10-96× faster than Pandoc.
Documentation
# PROJECT_STATE.md — Exine Status Report

> **Generated:** 2026-02-18 · **Binary:** `exine` · **Crate:** `exine v0.1.1`
> **License:** MIT

---

## 1. Current Architecture

### Rust Module Map

```
src/
├── main.rs              # Entry point + pipeline orchestration (44 KB)
├── cli.rs               # clap CLI — 11 subcommands (convert, fetch, crawl, serve, render, shield…)
├── config.rs            # ExineConfig — .exine.toml + EXINE_* env vars
├── error.rs             # ExineError (17 variants incl. CaptchaBlocked)
├── lib.rs               # Library root — re-exports ExineConfig, ExineError
├── shield.rs            # Findability Shield — JS snippet + static headers
│
├── extract/             # 29 modules — core extraction engine
│   ├── mod.rs           # Extractor trait + format dispatcher
│   ├── web.rs           # HTML→Markdown (scraper-based, 23 KB)
│   ├── html.rs          # Low-level HTML→GFM converter (24 KB)
│   ├── pdf.rs           # PDF extraction (pdf-extract + lopdf)
│   ├── docx.rs          # DOCX extraction (docx-rs)
│   ├── pptx.rs          # PPTX extraction (quick-xml + zip)
│   ├── xlsx.rs          # XLSX extraction (calamine)
│   ├── epub.rs          # EPUB extraction (zip + quick-xml)
│   ├── rtf.rs           # RTF extraction
│   ├── svg.rs           # SVG extraction
│   ├── text.rs          # Plain text pass-through
│   ├── ocr.rs           # OCR via Tesseract (feature: ocr)
│   ├── audio.rs         # STT via Whisper.cpp (feature: stt)
│   ├── vision.rs        # Vision AI — Gemini, OpenAI, Claude, Mistral
│   ├── stealth.rs       # Stealth HTTP client (UA rotation, retries, fingerprint)
│   ├── browser.rs       # Headless Chrome pool (chromiumoxide, feature: headless)
│   ├── crawl.rs         # Site crawler — hybrid tokio/rayon architecture
│   ├── captcha.rs       # 3-layer CAPTCHA solver (stealth → audio → vision)
│   ├── cache.rs         # SQLite fetch cache (12 KB)
│   ├── cookies.rs       # Cookie jar management for auth sessions
│   ├── interact.rs      # CDP interaction scripting (click, type, scroll)
│   ├── paginate.rs      # Multi-strategy pagination (next-link, infinite scroll)
│   ├── clean.rs         # Data cleaning rules engine
│   ├── download.rs      # File download manager
│   ├── alerts.rs        # Webhook alerting (Slack, Discord, generic HTTP)
│   ├── chunk.rs         # Token-aware markdown chunking
│   ├── metadata.rs      # Frontmatter extraction
│   ├── linkedin.rs      # LinkedIn profile → structured CV/social/graph
│   └── output.rs        # Output formatting (JSON, YAML, CSV)
│
├── privacy/
│   ├── pii.rs           # Regex PII redaction (email, IBAN, phone, CC, IP)
│   ├── byok.rs          # AES-256-GCM encryption (BYOK)
│   └── kms.rs           # KMS integration (AWS, Azure, HashiCorp Vault)
│
├── enrich/
│   ├── mistral.rs       # Mistral AI client (EU-hosted enrichment)
│   └── aeo.rs           # Answer Engine Optimization (FAQ rewriting, schema.org)
│
├── sync/
│   └── mod.rs           # Encrypted upload + X-Delete-Object (Right to Erasure)
│
├── render/
│   └── mod.rs           # Markdown→HTML / Markdown→Document rendering
│
└── dashboard/
    ├── mod.rs           # Dashboard entry point
    └── server.rs        # Axum SPA — browser-side decryption (zero server knowledge)
```

### JS Tag Structure (Findability Shield)

The `shield.rs` module generates a `<2 KB` JavaScript snippet for publishers:

| Component | Purpose |
|---|---|
| **Bot Detection** | Regex matching 13 AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended…) + custom bots |
| **`<link rel="alternate">`** | Injects `type="text/markdown"` pointing to CDN-hosted Smart Markdown |
| **`X-Markdown` meta** | HTTP-equivalent header for bot consumption |
| **JSON-LD** | `schema.org/WebPage` with `MediaObject` encoding reference |
| **Static Headers** | Server-side alternative for non-JS environments |

### Cloud Bridge Logic (Sync Module)

| Aspect | Implementation |
|---|---|
| **Transport** | `reqwest` POST to configurable `EXINE_SYNC_ENDPOINT` |
| **Encryption** | Client-side AES-256-GCM via BYOK — server only sees encrypted blobs |
| **Right to Erasure** | Cryptographic `X-Delete-Object` token per upload |
| **Provider Header** | `X-Encryption-Provider` identifies key source |
| **Dashboard** | Axum SPA with browser-side decryption (zero-knowledge server) |

---

## 2. Key Files

### Core Engine

| File | Size | Purpose |
|---|---|---|
| `src/main.rs` | 44 KB | Pipeline orchestration — routes all 11 commands |
| `src/cli.rs` | 12 KB | clap CLI definition (Parser + Subcommand) |
| `src/config.rs` | 4 KB | `.exine.toml` + 7 env var overrides |
| `src/error.rs` | 2 KB | `ExineError` — 17 typed error variants |
| `src/lib.rs` | 257 B | Library root re-exports |
| `Cargo.toml` | 3 KB | 40+ dependencies, 6 feature flags |

### Scraper / Extraction

| File | Size | Purpose |
|---|---|---|
| `extract/web.rs` | 23 KB | URL fetch + HTML→Markdown with selectors |
| `extract/html.rs` | 24 KB | Low-level HTML→GFM converter (tables, lists, code blocks) |
| `extract/stealth.rs` | 17 KB | Stealth HTTP client (UA rotation, proxy, headers, retries) |
| `extract/browser.rs` | 9 KB | Headless Chrome pool via chromiumoxide |
| `extract/crawl.rs` | 13 KB | Site crawler — hybrid tokio (I/O) + rayon (CPU) |
| `extract/captcha.rs` | 25 KB | 3-layer CAPTCHA solver (stealth patches, audio STT, vision AI) |
| `extract/linkedin.rs` | 19 KB | LinkedIn profile → structured CV/social/graph |
| `extract/cache.rs` | 12 KB | SQLite fetch cache with SHA-256 dedup |
| `extract/cookies.rs` | 9 KB | Cookie jar for authenticated sessions |
| `extract/interact.rs` | 7 KB | CDP scripting (click, type, wait, scroll) |
| `extract/paginate.rs` | 7 KB | Multi-strategy pagination handler |
| `extract/clean.rs` | 7 KB | Data cleaning rules engine |
| `extract/download.rs` | 9 KB | File download manager |
| `extract/alerts.rs` | 5 KB | Webhook notifications (Slack, Discord, HTTP) |

### Privacy / Compliance

| File | Size | Purpose |
|---|---|---|
| `privacy/pii.rs` | 5 KB | Regex PII redaction (6 patterns) |
| `privacy/byok.rs` | 6 KB | AES-256-GCM encrypt/decrypt with local key files |
| `privacy/kms.rs` | 11 KB | KMS abstraction (AWS, Azure, HashiCorp Vault) |
| `shield.rs` | 6 KB | Findability Shield JS snippet generator |

### Supporting Files

| File | Purpose |
|---|---|
| `site/index.html` | Landing page (29 KB) |
| `tests/cli_test.rs` | 10 integration tests |
| `benches/*.rs` | 4 Criterion benchmarks (extraction, privacy, pipeline, parser) |

---

## 3. Rule Alignment

> **Note:** No `.antigravity/rules.md` file exists in the workspace. The following documents how the codebase embodies the GDPR-first, Mistral-EU, and BYOK principles.

### GDPR Compliance

| Rule | Implementation | Files |
|---|---|---|
| **Zero-Persistence** | All processing is in-memory, no temp files retained | `main.rs`, `extract/*.rs` |
| **PII Redaction** | `--redact` flag masks emails, IBANs, phones, CCs, IPs before any cloud call | `privacy/pii.rs` |
| **Right to Erasure** | Every sync upload includes cryptographic `X-Delete-Object` token | `sync/mod.rs` |
| **Data Minimization** | Extraction yields only structured markdown, stripping scripts/styles/tracking | `extract/html.rs`, `extract/web.rs` |
| **Regional Pinning** | Mistral region defaults to `eu-west`, configurable via `EXINE_MISTRAL_REGION` | `config.rs` |

### Mistral AI (EU-Hosted)

| Rule | Implementation | Files |
|---|---|---|
| **EU Residency** | Default region `eu-west`, customizable to Azure Germany West | `config.rs` |
| **Enrichment** | Structural improvement + metadata extraction via `mistral-large-latest` | `enrich/mistral.rs` |
| **AEO** | FAQ rewriting + schema.org hints (local mode + Mistral mode) | `enrich/aeo.rs` |
| **PII-First** | Redaction pass runs BEFORE enrichment — no raw PII hits the API | `main.rs` pipeline order |

### BYOK (Bring Your Own Key)

| Rule | Implementation | Files |
|---|---|---|
| **AES-256-GCM** | Client-side encryption with user-controlled keys | `privacy/byok.rs` |
| **Provider Abstraction** | `local`, `aws-kms`, `azure-keyvault`, `hashicorp-vault` | `privacy/kms.rs` |
| **Zero-Knowledge Sync** | Dashboard server never sees plaintext — browser-side decryption | `dashboard/server.rs` |
| **Key Path Config** | `EXINE_BYOK_KEY_PATH` or `.exine.toml` `byok_key_path` | `config.rs` |

---

## 4. Pending Tasks

### Findability Shield — Next Steps

| Priority | Task | Details |
|---|---|---|
| ✅ Done | **CDN upload integration** | Wire `shield` command to actually push `.md` files to S3/R2/Cloudflare Pages |
| ✅ Done | **robots.txt generator** | Generate AI-bot-friendly `robots.txt` with markdown sitemaps |
| 🟡 Medium | **Shield analytics** | Track bot hit rates via lightweight ping-back (optional, privacy-aware) |
| 🟡 Medium | **Cache-busting** | Content-hash filenames for CDN cache invalidation |
| 🟢 Low | **WordPress plugin** | Package the shield snippet as a WP plugin |

### Kickass Scraper — Next Steps

| Priority | Task | Details |
|---|---|---|
| ✅ Done | **Wire CAPTCHA solver** | Integrate `captcha.rs` orchestrator into `stealth.rs` fetch loop (detect → solve → retry) |
| ✅ Done | **Pagination + infinite scroll** | Connect `paginate.rs` to `crawl.rs` for auto-paginated crawls |
| 🟡 Medium | **Cookie auth flow** | End-to-end test: login via cookies → scrape authenticated pages |
| 🟡 Medium | **Dropdown interaction** | Wire `interact.rs` actions into `crawl.rs` for dynamic content |
| 🟡 Medium | **Download manager** | Integrate `download.rs` into crawl pipeline for file harvesting |
| 🟢 Low | **Rate limiting** | Adaptive delay based on `429` / `Retry-After` headers |
| 🟢 Low | **Crawl resume** | SQLite checkpoint for resumable crawls after interruption |

### Audit & Security (Round 2) — Completed
| Priority | Task | Details |
|---|---|---|
| ✅ Fixed | **Stored XSS** | Dashboard template injection remediated via DOM methods |
| ✅ Fixed | **Auth Enforcement** | Frontend now sends tokens; backend enforces checks |
| ✅ Fixed | **Right to Erasure** | Cryptographic delete tokens surfaced to sync callers |
| ✅ Verified | **Cookie Security** | `Secure`, `Path`, `Expires` attributes strictly enforced |
| ✅ Verified | **Race Conditions** | Async `tokio::Mutex` + atomic checks for document caps |

---

## 5. Benchmarking

### Competitive Results (`benches/competitive/results.json`)

#### HTML Extraction — Time to Markdown (ms)

| Fixture | Exine | Pandoc | Markitdown | html2text |
|---|---|---|---|---|
| small (7 KB) | **4.9** | 50.2 | 442.8 | 41.7 |
| medium (33 KB) | **5.2** | 86.2 | 466.6 | 52.3 |
| large (133 KB) | **6.3** | 236.6 | 604.9 | 104.4 |
| xlarge (332 KB) | **9.4** | 521.5 | 686.1 | 156.5 |

#### Throughput (MiB/s)

| Fixture | Exine | Pandoc | Markitdown | html2text |
|---|---|---|---|---|
| small (7 KB) | **1.37** | 0.13 | 0.02 | 0.16 |
| medium (33 KB) | **6.29** | 0.38 | 0.07 | 0.62 |
| large (133 KB) | **20.55** | 0.55 | 0.21 | 1.24 |
| xlarge (332 KB) | **34.37** | 0.62 | 0.47 | 2.07 |

#### Text Extraction (ms)

| Fixture | Exine | Markitdown |
|---|---|---|
| small (7 KB) | **4.0** | 457.1 |
| medium (34 KB) | **4.6** | 432.8 |
| large (172 KB) | **4.5** | 456.3 |
| xlarge (690 KB) | **5.8** | 442.5 |

#### DOCX Extraction (ms)

| Fixture | Exine | Pandoc |
|---|---|---|
| small (36 KB) | **20.1** | 72.5 |
| medium (36 KB) | **21.4** | 87.3 |
| large (37 KB) | **24.9** | 118.2 |

### Summary

| Metric | vs Pandoc | vs Markitdown | vs html2text |
|---|---|---|---|
| **HTML speed** | **10–55x faster** | **73–96x faster** | **8–17x faster** |
| **Text speed** | N/A | **77–114x faster** | N/A |
| **DOCX speed** | **3.6–4.7x faster** | N/A | N/A |
| **Peak throughput** | 34.4 MiB/s |||

### Criterion Benchmarks (Rust-native)

Four Criterion benchmark suites are configured in `Cargo.toml`:

| Bench | Scope |
|---|---|
| `extraction_bench` | Per-format extraction latency |
| `privacy_bench` | PII redaction throughput |
| `pipeline_bench` | End-to-end pipeline (extract → redact → enrich) |
| `parser_bench` | HTML parser microbenchmarks |

Run with: `cargo bench`

---

## 6. Test Coverage

```
cargo test → 192 tests (182 lib + 10 CLI integration)
All passing ✓
```

| Module | Tests |
|---|---|
| `extract::captcha` | 22 (detection, stealth, audio, vision, strategy) |
| `extract::html` | Multi (GFM table, heading, code block, list fidelity) |
| `extract::crawl` | 3 (config, link extraction, frontmatter) |
| `extract::cookies` | 8 (parsing, domain, secure, path, expires, roundtrip) |
| `extract::ocr_cli` | 5 (table, heading, paragraph, single-line regression) |
| `privacy::pii` | 7 (email, IBAN, phone, CC, IP, selective, no-PII) |
| `shield` | 8 (size, bots, link-alternate, X-Markdown, CDN, extras, static headers) |
| `enrich::aeo` | 5 (header→question, header extraction, local summary) |
| `config` | 1 (defaults) |
| `cli_test.rs` | 10 (help, version, convert, redact, HTML, output file, errors) |

---

## 7. Feature Flags

| Flag | Default | Description |
|---|---|---|
| `dashboard` || Axum web dashboard + `serve` command |
| `headless` || Chromiumoxide headless browser pool |
| `ocr` || Tesseract OCR (requires `libtesseract`) |
| `stt` || Whisper.cpp speech-to-text (requires model file) |
| `vision` || Vision AI extraction (Gemini, OpenAI, Claude, Mistral) |

---

*End of PROJECT_STATE.md*