Expand description
Document text extraction — the dbmd extract engine.
sources/ is where raw evidence lands: invoices, contracts, reports,
exports. Most of it arrives as binary documents (PDF, Word, Excel, EPUB) or
HTML, not markdown. Before an agent can reason over that evidence — wiki-link
it, summarize it into the wiki layer, file a typed record that cites it — the
text has to come out. This module is that step: a binary document in, plain
UTF-8 text out, format chosen by file extension.
§What this is, and is not
- Deterministic decoders only. Every adapter is a format parser
(
pdf-extract,calamine,html2text,quick-xml+zip). There is no AI, no OCR, no embeddings here — consistent with the crate-wide invariant (lib.rs). The agent drivingdbmdis the semantic layer; this is plumbing. - Text layer, not pixels. A scanned PDF with no text layer yields the
empty string — empty in, empty out, never hallucinated text. OCR is an
explicit non-goal (a future
dbmd-ocr). - Single document, single call.
extracthandles one file. Walking a store and extracting every document is the caller’s loop, not this module’s.
§Format dispatch
Format::from_path maps the file extension to an adapter; extract
dispatches:
| Extension | Format | Adapter |
|---|---|---|
.pdf | Format::Pdf | pdf-extract |
.docx | Format::Docx | zip + quick-xml (w:t runs) |
.xlsx / .xlsm / .xlsb / .ods | Format::Spreadsheet | calamine |
.epub | Format::Epub | zip + quick-xml + html2text |
.html / .htm / .xhtml | Format::Html | html2text |
Anything else is ExtractError::UnsupportedFormat — a typed refusal the
CLI surfaces with a stable code, never a panic.
Structs§
- Extracted
- The result of extracting one document: the plain text plus a small, format-tagged metadata map.
Enums§
- Extract
Error - Errors from document extraction. Every variant is a typed refusal the CLI maps to a stable machine code — extraction never panics on a bad or encrypted input.
- Format
- The document formats
dbmd extractunderstands, one per adapter. Detected from the file extension byFormat::from_path. - Meta
Value - A metadata value: a string (title, format tag, sheet name list joined) or a
non-negative count (pages, sheets). Serializes to a bare JSON string or
number — no wrapper object — so
{text, metadata}stays flat and readable.
Functions§
- extract
- Extract plain text (and best-effort metadata) from a document, choosing the adapter by the file’s extension.
- normalize_
text - Canonicalize extracted text so output is stable across adapters:
Type Aliases§
- Result
- Result alias for extraction operations.