Skip to main content

Crate spectre_parse

Crate spectre_parse 

Source
Expand description

spectre_parse — a lazy, read-only PDF parser.

§Design — what “lazy” means here

Document::open does only these steps eagerly:

  1. Locate %PDF-x.y header (≤ 1 KB scan).
  2. Locate startxref (≤ 4 KB scan from EOF).
  3. Parse the xref table (or xref stream) at the start offset, plus every preceding Prev xref the trailer chain points at.
  4. Parse the trailer dictionary.

Steps 1–4 don’t materialize a single body object. The returned Document holds a borrow of the source bytes plus an indexed BTreeMap<ObjectId, XrefEntry>. Object bodies are parsed on the first call to Document::get_object (or anything that reaches get_object transitively — page tree walks, font lookups, etc.) and cached in an interior-mutable cell so subsequent reads are a plain map lookup.

§Scope (read-only)

spectre_parse reads PDFs and decrypts password-protected documents (full PDF Standard Security Handler: RC4 40/128-bit, AES-128 CBC, AES-256 R=5/R=6 with Algorithm 2.B — see [crate::decrypt]). It does NOT write PDFs, edit objects, or run JavaScript in form widgets. It exists to serve the read paths inside spectre_pdf (text extraction, table detection, structural metadata) with the smallest possible parse cost. The matching write/modify surface lives in lopdf-fork/ for development-time use.

Structs§

Content
Dictionary
PDF dictionary <<...>>. Insertion-ordered (PDF spec doesn’t require it but several generators rely on it for /Type to come first); look-up is by byte-slice key.
Document
A parsed PDF document. Lazy: open does xref + trailer only.
FormField
One AcroForm field returned by Document::get_form_fields. Dot-separated partial names per PDF spec §12.7.4.2 — a nested field “address” → “street” → “city” lands as address.street.city.
Operation
One content-stream instruction.
PdfImageInfo
Image inventory returned by Document::get_page_images.
Stream
PDF stream: a dictionary plus a (potentially compressed) content body. Content stays in the source buffer until first decode; the content field caches the decoded bytes after one [Document] call.
TocEntry
One outline (TOC) entry returned by Document::get_toc.

Enums§

Encoding
Error
Variants are fine-grained so callers can pattern-match — e.g. NoOutline downgrades to an empty TOC rather than surfacing.
FormFieldType
ImageEncoding
Container hint for Document::get_image_bytes.
Object
Dictionaries use IndexMap to preserve insertion order — several PDF generators rely on first-key-wins for /Type lookups.
ParseError

Functions§

resolve_page_encodings
Resolve every page font’s encoding once so a content-stream walk can look up by font resource name.

Type Aliases§

ObjectId
(object_number, generation). Most PDFs only use generation 0; the spec permits incremental updates to bump it.
Result