livre/structure/mod.rs
1//! Types that describe the structure of a PDF document and their extraction strategies.
2//!
3//! ## How do you parse a PDF?
4//!
5//! From the specification:
6//!
7//! > PDF processors should read a PDF file from its end. The last line of the file shall
8//! > contain only the end-of-file marker, `%%EOF`. The two preceding lines shall contain,
9//! > one per line and in order, the keyword `startxref` and the byte offset in the decoded
10//! > stream from the beginning of the PDF file to the beginning of the `xref` keyword in
11//! > the last cross-reference section. The `startxref` line shall be preceded by the trailer
12//! > dictionary
13//!
14//! Indeed, the very first step in parsing a PDF document is to obtain the cross-reference table.
15//! Indeed, the PDF specification takes a "random-access" and append-only strategy to allow the
16//! creation of PDFs in resource-constrained environment. Keep in mind that the technology is
17//! not young, and that PDF documents may include thousands of pages with intricate graphics.
18//! The random-access strategy allows PDF creators to serialise one object at a time and just
19//! reference to it where needed.
20//!
21//! That means that most components within a PDF document are hidden behind an indirection in the
22//! form of an *indirect object*. However, because knowing the indirect object's location in advance
23//! would dramatically increase the memory load (since you would need to know the size of the
24//! serialization before it is written to disk), the PDF specification resorts to generating
25//! a cross-reference table at the end of the file to map from a reference ID to a byte location
26//! within the document.
27//!
28//! Moreover, since the PDF specification allows modifications using an append-only strategy
29//! (again for resource minimisation purposes), the full cross-reference mapping is scattered
30//! accross multiple regions of the document. In practice, cross-reference tables form a linked
31//! list, each new table pointing to its predecessor's byte location in the document.
32//!
33//! Hence, the general strategy for parsing a PDF document becomes:
34//!
35//! 1. Collect the full cross-reference table:
36//! 1. Rush to the end of the file! You will find a [`startxref`](trailer_block::StartXRef)
37//! tag which holds the byte location of the first [cross-reference table/trailer
38//! bloc](trailer_block::XRefTrailerBlock).
39//! 2. That table may contain a link to the previous cross-reference dictionary - if it does,
40//! follow along and continue your way up the document until you have collected the full
41//! cross-reference table.
42//! 2. Iterate through the Pages dictionary.
43
44mod catalog;
45mod content;
46mod object_stream;
47mod pages;
48mod trailer_block;
49
50pub use catalog::{Catalog, PageLayout, PageMode};
51pub use object_stream::ObjectStream;
52pub use pages::{
53 IndividualPageProperties, InheritablePageProperties, Page, PageTreeNode, Resources,
54};
55pub use trailer_block::{RefLocation, StartXRef, Trailer, XRefTrailerBlock};