Module orchestrator

Expand description

Pipeline orchestrator — runs all processing stages in sequence.

  PDF bytes
    │
    ▼
 ┌─────────────────────────────────────────────┐
 │ Stage 0:  Page Range Filtering               │
 │ Stage 1b: Watermark Removal                  │
 │ Stage 2:  Content Filtering + FFFD Replace   │
 └──────────────────┬──────────────────────────┘
                    │  raw TextChunks, Lines, Images
                    ▼
 ┌─────────────────────────────────────────────┐
 │ Stage 3-4: Border Table Detection            │
 │ Stage 4b:  Content → Table Cells             │
 │ Stage 4c:  Boxed Heading Promoter            │
 │ Stage 4d:  Pre-Cluster Table Release         │
 └──────────────────┬──────────────────────────┘
                    │  TextChunks + TableBorders
                    ▼
 ┌─────────────────────────────────────────────┐
 │ Stage 5b: Column Detection                   │
 │ Stage 6:  TextChunk → TextLine Grouping      │
 │ Stage 6.5: List Detection Pass 1 (TextLine)  │
 │ Stage 7:  TextLine → TextBlock Grouping      │
 │ Stage 7b: Cluster (Borderless) Tables        │
 └──────────────────┬──────────────────────────┘
                    │  TextBlocks + Tables + Lists
                    ▼
 ┌─────────────────────────────────────────────┐
 │ Stage 8:  Header / Footer Detection          │
 │ Stage 9:  List Detection Pass 1 (Block)      │
 │ Stage 10: Paragraph Detection                │
 │ Stage 10b: Figure Detection                  │
 │ Stage 12: Heading Detection                  │
 └──────────────────┬──────────────────────────┘
                    │  Semantic elements
                    ▼
 ┌─────────────────────────────────────────────┐
 │ Stage 11:  List Detection Pass 2 (Paragraph) │
 │ Stage 13:  ID Assignment                     │
 │ Stage 14:  Caption + Footnote + TOC Linking  │
 │ Stage 15:  Cross-Page Table Linking          │
 │ Stage 17:  Nesting Levels                    │
 │ Stage 18:  Reading Order Sort                │
 │ Stage 19:  Content Sanitization              │
 └──────────────────┬──────────────────────────┘
                    │
                    ▼
             PdfDocument (ready for output)

Structs§

PipelineState: Pipeline state passed between stages.

Functions§

run_pipeline: Run the full 20-stage pipeline.

Type Aliases§

PageContent: Per-page content during pipeline processing.

Module orchestrator

Module orchestrator Copy item path

Structs§

Functions§

Type Aliases§

Module orchestrator