spectre_parse — a lazy, read-only PDF parser.
Design — what "lazy" means here
[Document::open] does only these steps eagerly:
- Locate
%PDF-x.yheader (≤ 1 KB scan). - Locate
startxref(≤ 4 KB scan from EOF). - Parse the xref table (or xref stream) at the start offset, plus
every preceding
Prevxref the trailer chain points at. - Parse the trailer dictionary.
Steps 1–4 don't materialize a single body object. The returned
[Document] holds a borrow of the source bytes plus an indexed
BTreeMap<ObjectId, XrefEntry>. Object bodies are parsed on the
first call to [Document::get_object] (or anything that reaches
get_object transitively — page tree walks, font lookups, etc.)
and cached in an interior-mutable cell so subsequent reads are a
plain map lookup.
Scope (read-only)
spectre_parse reads PDFs and decrypts password-protected documents
(full PDF Standard Security Handler: RC4 40/128-bit, AES-128 CBC,
AES-256 R=5/R=6 with Algorithm 2.B — see [crate::decrypt]). It does
NOT write PDFs, edit objects, or run JavaScript in form widgets. It
exists to serve the read paths inside spectre_pdf (text extraction,
table detection, structural metadata) with the smallest possible
parse cost. The matching write/modify surface lives in
lopdf-fork/ for development-time use.