Expand description
spectre_parse — a lazy, read-only PDF parser.
§Design — what “lazy” means here
Document::open does only these steps eagerly:
- Locate
%PDF-x.yheader (≤ 1 KB scan). - Locate
startxref(≤ 4 KB scan from EOF). - Parse the xref table (or xref stream) at the start offset, plus
every preceding
Prevxref the trailer chain points at. - Parse the trailer dictionary.
Steps 1–4 don’t materialize a single body object. The returned
Document holds a borrow of the source bytes plus an indexed
BTreeMap<ObjectId, XrefEntry>. Object bodies are parsed on the
first call to Document::get_object (or anything that reaches
get_object transitively — page tree walks, font lookups, etc.)
and cached in an interior-mutable cell so subsequent reads are a
plain map lookup.
§Scope (read-only)
spectre_parse reads PDFs and decrypts password-protected documents
(full PDF Standard Security Handler: RC4 40/128-bit, AES-128 CBC,
AES-256 R=5/R=6 with Algorithm 2.B — see [crate::decrypt]). It does
NOT write PDFs, edit objects, or run JavaScript in form widgets. It
exists to serve the read paths inside spectre_pdf (text extraction,
table detection, structural metadata) with the smallest possible
parse cost. The matching write/modify surface lives in
lopdf-fork/ for development-time use.
Structs§
- Content
- Dictionary
- PDF dictionary
<<...>>. Insertion-ordered (PDF spec doesn’t require it but several generators rely on it for/Typeto come first); look-up is by byte-slice key. - Document
- A parsed PDF document. Lazy:
opendoes xref + trailer only. - Form
Field - One AcroForm field returned by
Document::get_form_fields. Dot-separated partial names per PDF spec §12.7.4.2 — a nested field “address” → “street” → “city” lands asaddress.street.city. - Operation
- One content-stream instruction.
- PdfImage
Info - Image inventory returned by
Document::get_page_images. - Stream
- PDF stream: a dictionary plus a (potentially compressed) content body.
Content stays in the source buffer until first decode; the
contentfield caches the decoded bytes after one [Document] call. - TocEntry
- One outline (TOC) entry returned by
Document::get_toc.
Enums§
- Encoding
- Error
- Variants are fine-grained so callers can pattern-match — e.g.
NoOutlinedowngrades to an empty TOC rather than surfacing. - Form
Field Type - Image
Encoding - Container hint for
Document::get_image_bytes. - Object
- Dictionaries use
IndexMapto preserve insertion order — several PDF generators rely on first-key-wins for/Typelookups. - Parse
Error
Functions§
- resolve_
page_ encodings - Resolve every page font’s encoding once so a content-stream walk can look up by font resource name.