edgeparse-core
High-performance PDF-to-structured-data extraction engine.
edgeparse-core implements a 20-stage processing pipeline that extracts text,
tables, images, and semantic structure from PDF documents and produces
structured output in Markdown, JSON, HTML, or plain text.
Usage
use ;
use Path;
let config = default;
let doc = convert?;
println!;
for element in &doc.kids
Output Formats
Generate output in multiple formats using the output modules:
use output;
let markdown = to_markdown?;
let json = to_legacy_json_string?;
let html = to_html?;
let text = to_text?;
Features
- Tagged PDF support — uses PDF structure tree for semantic extraction
- Table detection — border-based and cluster detection methods
- Reading order — XY-Cut++ algorithm for correct reading order
- Image extraction — embedded or external image output
- Content safety — filters hidden text, off-page content, tiny text
- PII sanitization — optional personal data redaction
- Multi-column layout — automatic column detection and ordering
Feature Flags
| Flag | Description |
|---|---|
hybrid |
Enable Docling backend integration (requires tokio + reqwest) |
License
Apache-2.0