Skip to main content

Module extract

Module extract 

Source
Expand description

Document text extraction — the dbmd extract engine.

sources/ is where raw evidence lands: invoices, contracts, reports, exports. Most of it arrives as binary documents (PDF, Word, Excel, EPUB) or HTML, not markdown. Before an agent can reason over that evidence — wiki-link it, summarize it into the wiki layer, file a typed record that cites it — the text has to come out. This module is that step: a binary document in, plain UTF-8 text out, format chosen by file extension.

§What this is, and is not

  • Deterministic decoders only. Every adapter is a format parser (pdf-extract, calamine, html2text, quick-xml+zip). There is no AI, no OCR, no embeddings here — consistent with the crate-wide invariant (lib.rs). The agent driving dbmd is the semantic layer; this is plumbing.
  • Text layer, not pixels. A scanned PDF with no text layer yields the empty string — empty in, empty out, never hallucinated text. OCR is an explicit non-goal (a future dbmd-ocr).
  • Single document, single call. extract handles one file. Walking a store and extracting every document is the caller’s loop, not this module’s.

§Format dispatch

Format::from_path maps the file extension to an adapter; extract dispatches:

ExtensionFormatAdapter
.pdfFormat::Pdfpdf-extract
.docxFormat::Docxzip + quick-xml (w:t runs)
.xlsx / .xlsm / .xlsb / .odsFormat::Spreadsheetcalamine
.epubFormat::Epubzip + quick-xml + html2text
.html / .htm / .xhtmlFormat::Htmlhtml2text

Anything else is ExtractError::UnsupportedFormat — a typed refusal the CLI surfaces with a stable code, never a panic.

Structs§

Extracted
The result of extracting one document: the plain text plus a small, format-tagged metadata map.

Enums§

ExtractError
Errors from document extraction. Every variant is a typed refusal the CLI maps to a stable machine code — extraction never panics on a bad or encrypted input.
Format
The document formats dbmd extract understands, one per adapter. Detected from the file extension by Format::from_path.
MetaValue
A metadata value: a string (title, format tag, sheet name list joined) or a non-negative count (pages, sheets). Serializes to a bare JSON string or number — no wrapper object — so {text, metadata} stays flat and readable.

Functions§

extract
Extract plain text (and best-effort metadata) from a document, choosing the adapter by the file’s extension.
normalize_text
Canonicalize extracted text so output is stable across adapters:

Type Aliases§

Result
Result alias for extraction operations.