pdf_core

Low-level PDF parser foundation for semantic PDF diff and PDF comparison.

pdf_core parses enough of the PDF object graph to support the semantic-pdf-diff pipeline without depending on third-party PDF parser or renderer libraries. It is aimed at evidence-preserving comparison of digitally generated PDFs, where partial results and stable diagnostics are better than silently ignoring unsupported features.

What This Crate Provides

PDF header and primitive object parsing.
Indirect object and stream object parsing with byte-range provenance.
Classic xref table and trailer parsing.
Controlled /Type /XRef xref stream support.
Controlled /Type /ObjStm object stream extraction through ObjectStore.
Stream decoding for no-filter, FlateDecode, ASCIIHexDecode, and RunLengthDecode, including ordered chains of those supported filters with paired /DecodeParms metadata preserved.
Catalog /Pages traversal with ordered /Kids, inherited /Resources, /MediaBox, /CropBox, and /Rotate.
Page content stream resolution for single /Contents streams and ordered /Contents [...] arrays.
Simple /StructTreeRoot, /RoleMap, and parent-tree summaries with structure type names, mapped role names, and MCID references.
Incremental-update metadata for repeated startxref sections, the selected latest xref offset, prior xref offsets, and trailer /Prev offsets.
Resource-limit enforcement through spdfdiff_types::ResourceLimits.

Pipeline Context

pdf_core is the first stage of the workspace pipeline:

PDF bytes -> pdf_core object graph/pages/streams -> pdf_content operators

It intentionally does not perform semantic text comparison. Downstream crates consume its page content streams, object provenance, tagged-structure summaries, and diagnostics.

Diagnostics Instead Of Hidden Failure

Unsupported filters, failed stream decodes, malformed object streams, encrypted PDFs, damaged xrefs, and resource-limit violations produce stable diagnostics or typed errors. Raw stream bytes are preserved when possible so later tooling can still report partial evidence.

Current Compatibility Boundary

This is a compatibility-gate parser foundation, not a claim of broad PDF renderer compatibility. It currently focuses on parser constructs needed by the semantic diff CLI and sample corpus. Native rendering, visual diffing, full annotation semantics, JavaScript actions, and arbitrary damaged-PDF recovery are outside this crate's current scope.

Use ParseConfig and ResourceLimits when parsing untrusted PDFs.

pdf_core 0.1.10

pdf_core

What This Crate Provides

Pipeline Context

Diagnostics Instead Of Hidden Failure

Current Compatibility Boundary