Skip to main content

Crate pdfmuse_core

Crate pdfmuse_core 

Source
Expand description

pdfmuse-core — deterministic PDF/DOCX parser core.

The naive parse() lands in PER-33 and the self-written content-stream interpreter (the real value) in PER-36. The unified IR — the data foundation that every binding serializes byte-identically — lives in ir.

Re-exports§

pub use error::PdfmuseError;
pub use error::Result;

Modules§

backend
Pluggable vision backend — the ML boundary.
error
Structured error type.
ir
Unified intermediate representation (IR).

Structs§

Chunk
A retrieval unit: a block’s text plus the context needed to cite it.

Enums§

Format
Source-format hint for parse.

Functions§

chunk
Split doc into chunks (one per non-empty block), tracking heading context.
parse
Parse data into the unified ir::Document.
parse_with_password
Like parse, but supplies a password for encrypted PDFs.
to_json
Serialize the entire Document to pretty-printed JSON.
to_markdown
Render doc to GitHub-flavored Markdown, pages and blocks in order.
to_text
Render doc to plain reading-order text — no Markdown syntax, just the block text joined by newlines. The cheapest useful output for search / ATS / feeding an LLM, and (via the bindings) avoids materializing the full IR on the host.