Skip to main content

Crate rpdfium_parser

Crate rpdfium_parser 

Source
Expand description

PDF file structure parser for rpdfium — a faithful Rust port of PDFium.

This crate implements the PDF file format parser, including:

  • Object model: PDF objects (null, boolean, integer, real, string, name, array, dictionary, stream, reference) with lazy resolution.
  • Tokenizer: Low-level byte-to-token conversion.
  • Header parsing: %PDF-X.Y version detection.
  • Cross-reference tables: Traditional xref and PDF 1.5+ xref streams.
  • Trailer parsing: startxref location, /Prev chain following.
  • Object streams: ObjStm (PDF 1.5+) decompression and extraction.
  • ObjectStore: Central thread-safe lazy-parsing object repository.
  • Linearization detection: Checks for linearized PDF markers.
  • Content stream tokenization: Parses PostScript-like operator sequences.

§Design Principles

  • #![forbid(unsafe_code)]
  • All deep operations are iterative (explicit Vec stacks), never recursive.
  • OnceLock-based lazy parsing: each object is parsed at most once.
  • Stream /Length uses direct-object-only + endstream scan fallback.
  • Security limits enforced: MAX_OBJECT_NUMBER, MAX_RECURSION_DEPTH, etc.

Re-exports§

pub use content_stream::Operand;
pub use content_stream::Operator;
pub use content_stream::TextArrayElement;
pub use content_stream::tokenize_content_stream;
pub use crypto::CryptoError;
pub use filter::resolve_filter_chain;
pub use header::PdfVersion;
pub use hint_tables::HintTables;
pub use hint_tables::PageOffsetHintTable;
pub use linearized_header::LinearizedInfo;
pub use linearized_header::detect_linearized;
pub use object::Object;
pub use object::StreamData;
pub use object_walker::ObjectStats;
pub use object_walker::ObjectVisitor;
pub use object_walker::ObjectWalker;
pub use security::Permissions;
pub use security::SecurityError;
pub use security::SecurityHandler;
pub use store::ObjectStore;
pub use trailer::TrailerInfo;
pub use xref::XrefEntry;
pub use xref::XrefEntryType;
pub use xref::XrefSection;
pub use xref::XrefTable;

Modules§

content_stream
Content stream operator tokenization (Stage 1).
crypto
Low-level cryptographic primitives for PDF encryption.
filter
Filter chain resolution — maps stream dictionary /Filter and /DecodeParms entries to codec types.
header
PDF header parsing — detects %PDF-X.Y and returns the version.
hint_tables
Linearization hint tables – page offset and shared object hint table parsing.
linearized_header
Linearization detection.
object
PDF object model — ObjectId, StreamData, and the Object enum.
object_parser
PDF object parsing — builds Object values from token streams.
object_stream
Object stream (ObjStm) parsing.
object_walker
Object graph walker – iterative BFS traversal with cycle detection.
security
PDF Standard Security Handler (R2–R6).
store
ObjectStore — the central data structure for PDF object access.
tokenizer
Low-level PDF tokenizer.
trailer
Trailer parsing — locates startxref, parses trailer dictionary, and follows the /Prev chain to build the full cross-reference table.
xref
Cross-reference table parsing (traditional xref format).
xref_stream
Cross-reference stream parsing (PDF 1.5+).

Structs§

ObjectId
Unique identifier for a PDF indirect object.