Crate spectre_parse

Expand description

spectre_parse — a lazy, read-only PDF parser.

§Design — what “lazy” means here

Document::open does only these steps eagerly:

Locate %PDF-x.y header (≤ 1 KB scan).
Locate startxref (≤ 4 KB scan from EOF).
Parse the xref table (or xref stream) at the start offset, plus every preceding Prev xref the trailer chain points at.
Parse the trailer dictionary.

Steps 1–4 don’t materialize a single body object. The returned Document holds a borrow of the source bytes plus an indexed BTreeMap<ObjectId, XrefEntry>. Object bodies are parsed on the first call to Document::get_object (or anything that reaches get_object transitively — page tree walks, font lookups, etc.) and cached in an interior-mutable cell so subsequent reads are a plain map lookup.

§Scope (read-only)

spectre_parse reads PDFs and decrypts password-protected documents (full PDF Standard Security Handler: RC4 40/128-bit, AES-128 CBC, AES-256 R=5/R=6 with Algorithm 2.B — see [crate::decrypt]). It does NOT write PDFs, edit objects, or run JavaScript in form widgets. It exists to serve the read paths inside spectre_pdf (text extraction, table detection, structural metadata) with the smallest possible parse cost. The matching write/modify surface lives in lopdf-fork/ for development-time use.

Structs§

Content
Dictionary: PDF dictionary <<...>>. Insertion-ordered (PDF spec doesn’t require it but several generators rely on it for /Type to come first); look-up is by byte-slice key.
Document: A parsed PDF document. Lazy: open does xref + trailer only.
FormField: One AcroForm field returned by Document::get_form_fields. Dot-separated partial names per PDF spec §12.7.4.2 — a nested field “address” → “street” → “city” lands as address.street.city.
Operation: One content-stream instruction.
PdfImageInfo: Image inventory returned by Document::get_page_images.
Stream: PDF stream: a dictionary plus a (potentially compressed) content body. Content stays in the source buffer until first decode; the content field caches the decoded bytes after one [Document] call.
TocEntry: One outline (TOC) entry returned by Document::get_toc.

Enums§

Encoding
Error: Variants are fine-grained so callers can pattern-match — e.g. NoOutline downgrades to an empty TOC rather than surfacing.
FormFieldType
ImageEncoding: Container hint for Document::get_image_bytes.
Object: Dictionaries use IndexMap to preserve insertion order — several PDF generators rely on first-key-wins for /Type lookups.
ParseError

Functions§

resolve_page_encodings: Resolve every page font’s encoding once so a content-stream walk can look up by font resource name.

Type Aliases§

ObjectId: (object_number, generation). Most PDFs only use generation 0; the spec permits incremental updates to bump it.
Result