spectre_parse 1.0.0

Lazy PDF parser — xref-only at open(), objects materialize on demand. Read-only. Powers the spectre_pdf extraction crate.
Documentation

spectre_parse — a lazy, read-only PDF parser.

Design — what "lazy" means here

[Document::open] does only these steps eagerly:

  1. Locate %PDF-x.y header (≤ 1 KB scan).
  2. Locate startxref (≤ 4 KB scan from EOF).
  3. Parse the xref table (or xref stream) at the start offset, plus every preceding Prev xref the trailer chain points at.
  4. Parse the trailer dictionary.

Steps 1–4 don't materialize a single body object. The returned [Document] holds a borrow of the source bytes plus an indexed BTreeMap<ObjectId, XrefEntry>. Object bodies are parsed on the first call to [Document::get_object] (or anything that reaches get_object transitively — page tree walks, font lookups, etc.) and cached in an interior-mutable cell so subsequent reads are a plain map lookup.

Scope (read-only)

spectre_parse reads PDFs and decrypts password-protected documents (full PDF Standard Security Handler: RC4 40/128-bit, AES-128 CBC, AES-256 R=5/R=6 with Algorithm 2.B — see [crate::decrypt]). It does NOT write PDFs, edit objects, or run JavaScript in form widgets. It exists to serve the read paths inside spectre_pdf (text extraction, table detection, structural metadata) with the smallest possible parse cost. The matching write/modify surface lives in lopdf-fork/ for development-time use.