Expand description
PDF loading layer — document loading, text extraction, line extraction.
Modules§
- annotation_
enrichment - Annotation enrichment — post-processing for extracted annotations.
- annotation_
extractor - PDF annotation extraction.
- bookmark_
extractor - PDF bookmark/outline extraction.
- chunk_
parser - Unified PDF content stream parser — matches the reference ChunkParser architecture.
- encryption
- PDF encryption detection and password-based loading.
- font
- PDF font handling — font resolution, glyph widths, and Unicode mapping.
- form_
extractor - AcroForm field extraction from PDF documents.
- graphics_
state - PDF graphics state tracking.
- hyperlink_
extractor - Hyperlink extraction from PDF link annotations.
- image_
extractor - PDF image extraction — find and extract inline/XObject images.
- line_
extractor - PDF line segment extraction — extract stroked/filled paths as LineChunks.
- loader
- PDF document loading via lopdf.
- metadata_
writer - PDF metadata writer — update document metadata (title, author, etc.) in a lopdf Document before saving.
- page_
info - Page geometry extraction — MediaBox, CropBox, Rotation.
- raster_
table_ ocr - Recover text signal from raster table images using local OCR.
- text_
extractor - Text extraction from PDF content streams.