marque-extract
Document text and metadata extraction for marque.
Status: stub. The Kreuzberg dependency is not yet wired in
Cargo.toml(see the TODO). The crate currently exposes the target API surface and returnsExtractError::UnsupportedFormatfor anything beyond raw-text reads. The shape below is whatmarque-extractwill deliver once integration lands; nothing here is production-usable yet.
The planned integration wraps Kreuzberg — a Rust-core, SIMD-optimized, streaming document extractor supporting 75+ formats with optional OCR for scanned documents. Its job is to produce the text stream the marque scanner consumes, plus a structured metadata report.
Role in Marque
The first stage of the marque pipeline:
Source → [marque-extract] → TextStream → Scanner → Parser → Rules → Diagnostics
Marque's rule engine operates on raw text. marque-extract is what turns a .docx, .pdf, image, or other supported format into the byte buffer the scanner reads. Metadata extraction runs in the same pass and is surfaced as MetadataWarning values — always reported; stripping is opt-in via ExtractionOptions::strip_metadata.
Not included in the WASM build. In the WASM context, the calling application is responsible for providing pre-extracted text directly to the engine. See marque-wasm.
Public API
| Type | Purpose |
|---|---|
Extractor |
Entry point — async extract for streaming, sync extract_bytes for in-memory. |
ExtractionOptions |
Per-call configuration: extract_metadata, strip_metadata, ocr. |
ExtractedDocument |
Output: text buffer + detected format + optional metadata report. |
MetadataReport, MetadataField, MetadataWarning |
Structured metadata findings. |
Usage
use ;
# async
Features
| Feature | Effect |
|---|---|
ocr |
Enables OCR backends via Kreuzberg for scanned-document support. |
WASM Compatibility
Not WASM-compatible. WASM builds must perform extraction in the host environment (browser, Node, worker) and pass text into marque-wasm directly.
License
Marque License 1.0 (LicenseRef-MarqueLicense-1.0). See LICENSE.md.