Expand description
Unified PDF content stream parser — matches the reference ChunkParser architecture.
Single-pass content stream walker that produces text, image, and line chunks with shared graphics state. Handles:
- Text operators (BT/ET/Tf/Td/Tm/Tj/TJ/etc.)
- Image extraction via
Dooperator (XObject images with CTM-based bbox) - Form XObject recursive processing via
Dooperator - Inline images (BI/ID/EI)
- Path/line operators (m/l/c/re/S/f/B/etc.)
- Graphics state (q/Q/cm/gs)
- Color operators (g/rg/k/cs/sc/etc.)
- Marked content (BMC/BDC/EMC)
Structs§
- Page
Chunks - All chunks extracted from a single page.
Functions§
- extract_
page_ chunks - Extract all chunks from a single page in one content stream pass.