Skip to main content

Module pdf

Module pdf 

Source
Expand description

PDF loading layer — document loading, text extraction, line extraction.

Modules§

annotation_enrichment
Annotation enrichment — post-processing for extracted annotations.
annotation_extractor
PDF annotation extraction.
bookmark_extractor
PDF bookmark/outline extraction.
chunk_parser
Unified PDF content stream parser — matches the reference ChunkParser architecture.
encryption
PDF encryption detection and password-based loading.
font
PDF font handling — font resolution, glyph widths, and Unicode mapping.
form_extractor
AcroForm field extraction from PDF documents.
graphics_state
PDF graphics state tracking.
hyperlink_extractor
Hyperlink extraction from PDF link annotations.
image_extractor
PDF image extraction — find and extract inline/XObject images.
line_extractor
PDF line segment extraction — extract stroked/filled paths as LineChunks.
loader
PDF document loading via lopdf.
metadata_writer
PDF metadata writer — update document metadata (title, author, etc.) in a lopdf Document before saving.
page_info
Page geometry extraction — MediaBox, CropBox, Rotation.
raster_table_ocr
Recover text signal from raster table images using local OCR.
text_extractor
Text extraction from PDF content streams.