Module pdf

Expand description

PDF loading layer — document loading, text extraction, line extraction.

Modules§

annotation_enrichment: Annotation enrichment — post-processing for extracted annotations.
annotation_extractor: PDF annotation extraction.
bookmark_extractor: PDF bookmark/outline extraction.
chunk_parser: Unified PDF content stream parser — matches the reference ChunkParser architecture.
encryption: PDF encryption detection and password-based loading.
font: PDF font handling — font resolution, glyph widths, and Unicode mapping.
form_extractor: AcroForm field extraction from PDF documents.
graphics_state: PDF graphics state tracking.
hyperlink_extractor: Hyperlink extraction from PDF link annotations.
image_extractor: PDF image extraction — find and extract inline/XObject images.
line_extractor: PDF line segment extraction — extract stroked/filled paths as LineChunks.
loader: PDF document loading via lopdf.
metadata_writer: PDF metadata writer — update document metadata (title, author, etc.) in a lopdf Document before saving.
page_info: Page geometry extraction — MediaBox, CropBox, Rotation.
raster_table_ocr: Recover text signal from raster table images using local OCR.
text_extractor: Text extraction from PDF content streams.