Expand description
Pipeline orchestrator — runs all processing stages in sequence.
PDF bytes
│
▼
┌─────────────────────────────────────────────┐
│ Stage 0: Page Range Filtering │
│ Stage 1b: Watermark Removal │
│ Stage 2: Content Filtering + FFFD Replace │
└──────────────────┬──────────────────────────┘
│ raw TextChunks, Lines, Images
▼
┌─────────────────────────────────────────────┐
│ Stage 3-4: Border Table Detection │
│ Stage 4b: Content → Table Cells │
│ Stage 4c: Boxed Heading Promoter │
│ Stage 4d: Pre-Cluster Table Release │
└──────────────────┬──────────────────────────┘
│ TextChunks + TableBorders
▼
┌─────────────────────────────────────────────┐
│ Stage 5b: Column Detection │
│ Stage 6: TextChunk → TextLine Grouping │
│ Stage 6.5: List Detection Pass 1 (TextLine) │
│ Stage 7: TextLine → TextBlock Grouping │
│ Stage 7b: Cluster (Borderless) Tables │
└──────────────────┬──────────────────────────┘
│ TextBlocks + Tables + Lists
▼
┌─────────────────────────────────────────────┐
│ Stage 8: Header / Footer Detection │
│ Stage 9: List Detection Pass 1 (Block) │
│ Stage 10: Paragraph Detection │
│ Stage 10b: Figure Detection │
│ Stage 12: Heading Detection │
└──────────────────┬──────────────────────────┘
│ Semantic elements
▼
┌─────────────────────────────────────────────┐
│ Stage 11: List Detection Pass 2 (Paragraph) │
│ Stage 13: ID Assignment │
│ Stage 14: Caption + Footnote + TOC Linking │
│ Stage 15: Cross-Page Table Linking │
│ Stage 17: Nesting Levels │
│ Stage 18: Reading Order Sort │
│ Stage 19: Content Sanitization │
└──────────────────┬──────────────────────────┘
│
▼
PdfDocument (ready for output)Structs§
- Pipeline
State - Pipeline state passed between stages.
Functions§
- run_
pipeline - Run the full 20-stage pipeline.
Type Aliases§
- Page
Content - Per-page content during pipeline processing.