Module extract

Expand description

Document text extraction — the dbmd extract engine.

sources/ is where raw evidence lands: invoices, contracts, reports, exports. Most of it arrives as binary documents (PDF, Word, Excel, EPUB) or HTML, not markdown. Before an agent can reason over that evidence — wiki-link it, summarize it into the wiki layer, file a typed record that cites it — the text has to come out. This module is that step: a binary document in, plain UTF-8 text out, format chosen by file extension.

§What this is, and is not

Deterministic decoders only. Every adapter is a format parser (pdf-extract, calamine, html2text, quick-xml+zip). There is no AI, no OCR, no embeddings here — consistent with the crate-wide invariant (lib.rs). The agent driving dbmd is the semantic layer; this is plumbing.
Text layer, not pixels. A scanned PDF with no text layer yields the empty string — empty in, empty out, never hallucinated text. OCR is an explicit non-goal (a future dbmd-ocr).
Single document, single call. extract handles one file. Walking a store and extracting every document is the caller’s loop, not this module’s.

§Format dispatch

Format::from_path maps the file extension to an adapter; extract dispatches:

Extension	Format	Adapter
`.pdf`	`Format::Pdf`	`pdf-extract`
`.docx`	`Format::Docx`	`zip` + `quick-xml` (`w:t` runs)
`.xlsx` / `.xlsm` / `.xlsb` / `.ods`	`Format::Spreadsheet`	`calamine`
`.epub`	`Format::Epub`	`zip` + `quick-xml` + `html2text`
`.html` / `.htm` / `.xhtml`	`Format::Html`	`html2text`

Anything else is ExtractError::UnsupportedFormat — a typed refusal the CLI surfaces with a stable code, never a panic.

Structs§

Extracted: The result of extracting one document: the plain text plus a small, format-tagged metadata map.

Enums§

ExtractError: Errors from document extraction. Every variant is a typed refusal the CLI maps to a stable machine code — extraction never panics on a bad or encrypted input.
Format: The document formats dbmd extract understands, one per adapter. Detected from the file extension by Format::from_path.
MetaValue: A metadata value: a string (title, format tag, sheet name list joined) or a non-negative count (pages, sheets). Serializes to a bare JSON string or number — no wrapper object — so {text, metadata} stays flat and readable.

Functions§

extract: Extract plain text (and best-effort metadata) from a document, choosing the adapter by the file’s extension.
normalize_text: Canonicalize extracted text so output is stable across adapters:

Type Aliases§

Result: Result alias for extraction operations.

Module extract

Module extract Copy item path

§What this is, and is not

§Format dispatch

Structs§

Enums§

Functions§

Type Aliases§

Module extract