Expand description
§PDFoxide
High-performance PDF parsing and conversion library built in Rust with Python bindings.
§Features (v0.1.0)
- PDF Parsing: Parse PDF 1.0-1.7 documents with full encryption support
- Text Extraction: Extract text with accurate Unicode mapping and ToUnicode CMap support
- Layout Analysis: Multi-column detection with XY-Cut and DBSCAN clustering
- Format Conversion: Convert to Markdown, HTML, and plain text
- Image Extraction: Extract embedded images (JPEG, PNG) with metadata
- Structure Tree: Parse PDF logical structure (tagged PDFs)
- Annotations: Extract PDF annotations, comments, and highlights
- Bookmarks: Extract document outline/bookmarks with hierarchy
- Python Bindings: Easy-to-use Python API via PyO3
§Planned for v1.0
- ML Integration: Advanced layout analysis with ONNX models
- Table Detection: Production-ready ML-based table extraction
- OCR: Text extraction from scanned PDFs via Tesseract
- WASM Target: Run in browsers via WebAssembly
- Digital Signatures: Signature verification and creation
§Quick Start
ⓘ
use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;
// Open a PDF
let mut doc = PdfDocument::open("paper.pdf")?;
// Extract text from first page
let text = doc.extract_text(0)?;
println!("{}", text);
// Convert to Markdown
let options = ConversionOptions::default();
let markdown = doc.to_markdown(0, &options)?;
// Extract images
let images = doc.extract_images(0)?;
```ignore
# Python Usage
```python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)
markdown = doc.to_markdown(0)
```ignore
# License
Licensed under either of:
* Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or <http://www.apache.org/licenses/LICENSE-2.0>)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or <http://opensource.org/licenses/MIT>)
at your option.Re-exports§
pub use annotations::Annotation;pub use annotations::LinkAction;pub use annotations::LinkDestination;pub use config::PdfConfig;pub use document::ExtractedImageRef;pub use document::ImageFormat;pub use document::PdfDocument;pub use error::Error;pub use error::Result;pub use outline::Destination;pub use outline::OutlineItem;
Modules§
- annotations
- PDF annotations support.
- config
- Configuration for PDF processing.
- content
- PDF content stream parsing and execution.
- converters
- Format converters for PDF documents.
- decoders
- Stream decoder implementations for PDF filters.
- document
- PDF document model.
- encryption
- PDF encryption support.
- error
- Error types for the PDF library.
- extractors
- Text and content extraction from PDF documents.
- fonts
- Font handling and encoding.
- geometry
- Geometric primitives for layout analysis.
- hybrid
- Hybrid classical + ML architecture.
- images
- Image extraction.
- layout
- Layout analysis algorithms for PDF documents.
- lexer
- PDF lexer (tokenizer).
- object
- PDF object types.
- objstm
- Object stream parsing (PDF 1.5+).
- outline
- PDF document outline (bookmarks) support.
- parser
- PDF object parser.
- parser_
config - Parser configuration options
- structure
- PDF logical structure (Tagged PDFs) PDF Logical Structure (Tagged PDF) support.
- xref
- Cross-reference table parser.
- xref_
reconstruction - Cross-reference table reconstruction for damaged PDFs.