# PROJECT_STATE.md — Exine Status Report
> **Generated:** 2026-02-18 · **Binary:** `exine` · **Crate:** `exine v0.1.1`
> **License:** MIT
---
## 1. Current Architecture
### Rust Module Map
```
src/
├── main.rs # Entry point + pipeline orchestration (44 KB)
├── cli.rs # clap CLI — 11 subcommands (convert, fetch, crawl, serve, render, shield…)
├── config.rs # ExineConfig — .exine.toml + EXINE_* env vars
├── error.rs # ExineError (17 variants incl. CaptchaBlocked)
├── lib.rs # Library root — re-exports ExineConfig, ExineError
├── shield.rs # Findability Shield — JS snippet + static headers
│
├── extract/ # 29 modules — core extraction engine
│ ├── mod.rs # Extractor trait + format dispatcher
│ ├── web.rs # HTML→Markdown (scraper-based, 23 KB)
│ ├── html.rs # Low-level HTML→GFM converter (24 KB)
│ ├── pdf.rs # PDF extraction (pdf-extract + lopdf)
│ ├── docx.rs # DOCX extraction (docx-rs)
│ ├── pptx.rs # PPTX extraction (quick-xml + zip)
│ ├── xlsx.rs # XLSX extraction (calamine)
│ ├── epub.rs # EPUB extraction (zip + quick-xml)
│ ├── rtf.rs # RTF extraction
│ ├── svg.rs # SVG extraction
│ ├── text.rs # Plain text pass-through
│ ├── ocr.rs # OCR via Tesseract (feature: ocr)
│ ├── audio.rs # STT via Whisper.cpp (feature: stt)
│ ├── vision.rs # Vision AI — Gemini, OpenAI, Claude, Mistral
│ ├── stealth.rs # Stealth HTTP client (UA rotation, retries, fingerprint)
│ ├── browser.rs # Headless Chrome pool (chromiumoxide, feature: headless)
│ ├── crawl.rs # Site crawler — hybrid tokio/rayon architecture
│ ├── captcha.rs # 3-layer CAPTCHA solver (stealth → audio → vision)
│ ├── cache.rs # SQLite fetch cache (12 KB)
│ ├── cookies.rs # Cookie jar management for auth sessions
│ ├── interact.rs # CDP interaction scripting (click, type, scroll)
│ ├── paginate.rs # Multi-strategy pagination (next-link, infinite scroll)
│ ├── clean.rs # Data cleaning rules engine
│ ├── download.rs # File download manager
│ ├── alerts.rs # Webhook alerting (Slack, Discord, generic HTTP)
│ ├── chunk.rs # Token-aware markdown chunking
│ ├── metadata.rs # Frontmatter extraction
│ ├── linkedin.rs # LinkedIn profile → structured CV/social/graph
│ └── output.rs # Output formatting (JSON, YAML, CSV)
│
├── privacy/
│ ├── pii.rs # Regex PII redaction (email, IBAN, phone, CC, IP)
│ ├── byok.rs # AES-256-GCM encryption (BYOK)
│ └── kms.rs # KMS integration (AWS, Azure, HashiCorp Vault)
│
├── enrich/
│ ├── mistral.rs # Mistral AI client (EU-hosted enrichment)
│ └── aeo.rs # Answer Engine Optimization (FAQ rewriting, schema.org)
│
├── sync/
│ └── mod.rs # Encrypted upload + X-Delete-Object (Right to Erasure)
│
├── render/
│ └── mod.rs # Markdown→HTML / Markdown→Document rendering
│
└── dashboard/
├── mod.rs # Dashboard entry point
└── server.rs # Axum SPA — browser-side decryption (zero server knowledge)
```
### JS Tag Structure (Findability Shield)
The `shield.rs` module generates a `<2 KB` JavaScript snippet for publishers:
| **Bot Detection** | Regex matching 13 AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended…) + custom bots |
| **`<link rel="alternate">`** | Injects `type="text/markdown"` pointing to CDN-hosted Smart Markdown |
| **`X-Markdown` meta** | HTTP-equivalent header for bot consumption |
| **JSON-LD** | `schema.org/WebPage` with `MediaObject` encoding reference |
| **Static Headers** | Server-side alternative for non-JS environments |
### Cloud Bridge Logic (Sync Module)
| **Transport** | `reqwest` POST to configurable `EXINE_SYNC_ENDPOINT` |
| **Encryption** | Client-side AES-256-GCM via BYOK — server only sees encrypted blobs |
| **Right to Erasure** | Cryptographic `X-Delete-Object` token per upload |
| **Provider Header** | `X-Encryption-Provider` identifies key source |
| **Dashboard** | Axum SPA with browser-side decryption (zero-knowledge server) |
---
## 2. Key Files
### Core Engine
| `src/main.rs` | 44 KB | Pipeline orchestration — routes all 11 commands |
| `src/cli.rs` | 12 KB | clap CLI definition (Parser + Subcommand) |
| `src/config.rs` | 4 KB | `.exine.toml` + 7 env var overrides |
| `src/error.rs` | 2 KB | `ExineError` — 17 typed error variants |
| `src/lib.rs` | 257 B | Library root re-exports |
| `Cargo.toml` | 3 KB | 40+ dependencies, 6 feature flags |
### Scraper / Extraction
| `extract/web.rs` | 23 KB | URL fetch + HTML→Markdown with selectors |
| `extract/html.rs` | 24 KB | Low-level HTML→GFM converter (tables, lists, code blocks) |
| `extract/stealth.rs` | 17 KB | Stealth HTTP client (UA rotation, proxy, headers, retries) |
| `extract/browser.rs` | 9 KB | Headless Chrome pool via chromiumoxide |
| `extract/crawl.rs` | 13 KB | Site crawler — hybrid tokio (I/O) + rayon (CPU) |
| `extract/captcha.rs` | 25 KB | 3-layer CAPTCHA solver (stealth patches, audio STT, vision AI) |
| `extract/linkedin.rs` | 19 KB | LinkedIn profile → structured CV/social/graph |
| `extract/cache.rs` | 12 KB | SQLite fetch cache with SHA-256 dedup |
| `extract/cookies.rs` | 9 KB | Cookie jar for authenticated sessions |
| `extract/interact.rs` | 7 KB | CDP scripting (click, type, wait, scroll) |
| `extract/paginate.rs` | 7 KB | Multi-strategy pagination handler |
| `extract/clean.rs` | 7 KB | Data cleaning rules engine |
| `extract/download.rs` | 9 KB | File download manager |
| `extract/alerts.rs` | 5 KB | Webhook notifications (Slack, Discord, HTTP) |
### Privacy / Compliance
| `privacy/pii.rs` | 5 KB | Regex PII redaction (6 patterns) |
| `privacy/byok.rs` | 6 KB | AES-256-GCM encrypt/decrypt with local key files |
| `privacy/kms.rs` | 11 KB | KMS abstraction (AWS, Azure, HashiCorp Vault) |
| `shield.rs` | 6 KB | Findability Shield JS snippet generator |
### Supporting Files
| `site/index.html` | Landing page (29 KB) |
| `tests/cli_test.rs` | 10 integration tests |
| `benches/*.rs` | 4 Criterion benchmarks (extraction, privacy, pipeline, parser) |
---
## 3. Rule Alignment
> **Note:** No `.antigravity/rules.md` file exists in the workspace. The following documents how the codebase embodies the GDPR-first, Mistral-EU, and BYOK principles.
### GDPR Compliance
| **Zero-Persistence** | All processing is in-memory, no temp files retained | `main.rs`, `extract/*.rs` |
| **PII Redaction** | `--redact` flag masks emails, IBANs, phones, CCs, IPs before any cloud call | `privacy/pii.rs` |
| **Right to Erasure** | Every sync upload includes cryptographic `X-Delete-Object` token | `sync/mod.rs` |
| **Data Minimization** | Extraction yields only structured markdown, stripping scripts/styles/tracking | `extract/html.rs`, `extract/web.rs` |
| **Regional Pinning** | Mistral region defaults to `eu-west`, configurable via `EXINE_MISTRAL_REGION` | `config.rs` |
### Mistral AI (EU-Hosted)
| **EU Residency** | Default region `eu-west`, customizable to Azure Germany West | `config.rs` |
| **Enrichment** | Structural improvement + metadata extraction via `mistral-large-latest` | `enrich/mistral.rs` |
| **AEO** | FAQ rewriting + schema.org hints (local mode + Mistral mode) | `enrich/aeo.rs` |
| **PII-First** | Redaction pass runs BEFORE enrichment — no raw PII hits the API | `main.rs` pipeline order |
### BYOK (Bring Your Own Key)
| **AES-256-GCM** | Client-side encryption with user-controlled keys | `privacy/byok.rs` |
| **Provider Abstraction** | `local`, `aws-kms`, `azure-keyvault`, `hashicorp-vault` | `privacy/kms.rs` |
| **Zero-Knowledge Sync** | Dashboard server never sees plaintext — browser-side decryption | `dashboard/server.rs` |
| **Key Path Config** | `EXINE_BYOK_KEY_PATH` or `.exine.toml` `byok_key_path` | `config.rs` |
---
## 4. Pending Tasks
### Findability Shield — Next Steps
| ✅ Done | **CDN upload integration** | Wire `shield` command to actually push `.md` files to S3/R2/Cloudflare Pages |
| ✅ Done | **robots.txt generator** | Generate AI-bot-friendly `robots.txt` with markdown sitemaps |
| 🟡 Medium | **Shield analytics** | Track bot hit rates via lightweight ping-back (optional, privacy-aware) |
| 🟡 Medium | **Cache-busting** | Content-hash filenames for CDN cache invalidation |
| 🟢 Low | **WordPress plugin** | Package the shield snippet as a WP plugin |
### Kickass Scraper — Next Steps
| ✅ Done | **Wire CAPTCHA solver** | Integrate `captcha.rs` orchestrator into `stealth.rs` fetch loop (detect → solve → retry) |
| ✅ Done | **Pagination + infinite scroll** | Connect `paginate.rs` to `crawl.rs` for auto-paginated crawls |
| 🟡 Medium | **Cookie auth flow** | End-to-end test: login via cookies → scrape authenticated pages |
| 🟡 Medium | **Dropdown interaction** | Wire `interact.rs` actions into `crawl.rs` for dynamic content |
| 🟡 Medium | **Download manager** | Integrate `download.rs` into crawl pipeline for file harvesting |
| 🟢 Low | **Rate limiting** | Adaptive delay based on `429` / `Retry-After` headers |
| 🟢 Low | **Crawl resume** | SQLite checkpoint for resumable crawls after interruption |
### Audit & Security (Round 2) — Completed
| ✅ Fixed | **Stored XSS** | Dashboard template injection remediated via DOM methods |
| ✅ Fixed | **Auth Enforcement** | Frontend now sends tokens; backend enforces checks |
| ✅ Fixed | **Right to Erasure** | Cryptographic delete tokens surfaced to sync callers |
| ✅ Verified | **Cookie Security** | `Secure`, `Path`, `Expires` attributes strictly enforced |
| ✅ Verified | **Race Conditions** | Async `tokio::Mutex` + atomic checks for document caps |
---
## 5. Benchmarking
### Competitive Results (`benches/competitive/results.json`)
#### HTML Extraction — Time to Markdown (ms)
| small (7 KB) | **4.9** | 50.2 | 442.8 | 41.7 |
| medium (33 KB) | **5.2** | 86.2 | 466.6 | 52.3 |
| large (133 KB) | **6.3** | 236.6 | 604.9 | 104.4 |
| xlarge (332 KB) | **9.4** | 521.5 | 686.1 | 156.5 |
#### Throughput (MiB/s)
| small (7 KB) | **1.37** | 0.13 | 0.02 | 0.16 |
| medium (33 KB) | **6.29** | 0.38 | 0.07 | 0.62 |
| large (133 KB) | **20.55** | 0.55 | 0.21 | 1.24 |
| xlarge (332 KB) | **34.37** | 0.62 | 0.47 | 2.07 |
#### Text Extraction (ms)
| small (7 KB) | **4.0** | 457.1 |
| medium (34 KB) | **4.6** | 432.8 |
| large (172 KB) | **4.5** | 456.3 |
| xlarge (690 KB) | **5.8** | 442.5 |
#### DOCX Extraction (ms)
| small (36 KB) | **20.1** | 72.5 |
| medium (36 KB) | **21.4** | 87.3 |
| large (37 KB) | **24.9** | 118.2 |
### Summary
| **HTML speed** | **10–55x faster** | **73–96x faster** | **8–17x faster** |
| **Text speed** | N/A | **77–114x faster** | N/A |
| **DOCX speed** | **3.6–4.7x faster** | N/A | N/A |
| **Peak throughput** | 34.4 MiB/s | — | — |
### Criterion Benchmarks (Rust-native)
Four Criterion benchmark suites are configured in `Cargo.toml`:
| `extraction_bench` | Per-format extraction latency |
| `privacy_bench` | PII redaction throughput |
| `pipeline_bench` | End-to-end pipeline (extract → redact → enrich) |
| `parser_bench` | HTML parser microbenchmarks |
Run with: `cargo bench`
---
## 6. Test Coverage
```
cargo test → 192 tests (182 lib + 10 CLI integration)
All passing ✓
```
| `extract::captcha` | 22 (detection, stealth, audio, vision, strategy) |
| `extract::html` | Multi (GFM table, heading, code block, list fidelity) |
| `extract::crawl` | 3 (config, link extraction, frontmatter) |
| `extract::cookies` | 8 (parsing, domain, secure, path, expires, roundtrip) |
| `extract::ocr_cli` | 5 (table, heading, paragraph, single-line regression) |
| `privacy::pii` | 7 (email, IBAN, phone, CC, IP, selective, no-PII) |
| `shield` | 8 (size, bots, link-alternate, X-Markdown, CDN, extras, static headers) |
| `enrich::aeo` | 5 (header→question, header extraction, local summary) |
| `config` | 1 (defaults) |
| `cli_test.rs` | 10 (help, version, convert, redact, HTML, output file, errors) |
---
## 7. Feature Flags
| `dashboard` | ✅ | Axum web dashboard + `serve` command |
| `headless` | ❌ | Chromiumoxide headless browser pool |
| `ocr` | ❌ | Tesseract OCR (requires `libtesseract`) |
| `stt` | ❌ | Whisper.cpp speech-to-text (requires model file) |
| `vision` | ❌ | Vision AI extraction (Gemini, OpenAI, Claude, Mistral) |
---
*End of PROJECT_STATE.md*