Skip to main content

Module chunk_parser

Module chunk_parser 

Source
Expand description

Unified PDF content stream parser — matches the reference ChunkParser architecture.

Single-pass content stream walker that produces text, image, and line chunks with shared graphics state. Handles:

  • Text operators (BT/ET/Tf/Td/Tm/Tj/TJ/etc.)
  • Image extraction via Do operator (XObject images with CTM-based bbox)
  • Form XObject recursive processing via Do operator
  • Inline images (BI/ID/EI)
  • Path/line operators (m/l/c/re/S/f/B/etc.)
  • Graphics state (q/Q/cm/gs)
  • Color operators (g/rg/k/cs/sc/etc.)
  • Marked content (BMC/BDC/EMC)

Structs§

PageChunks
All chunks extracted from a single page.

Functions§

extract_page_chunks
Extract all chunks from a single page in one content stream pass.