Crate pdf_oxide

Crate pdf_oxide 

Source
Expand description

§PDFoxide

High-performance PDF parsing and conversion library built in Rust with Python bindings.

§Features (v0.1.0)

  • PDF Parsing: Parse PDF 1.0-1.7 documents with full encryption support
  • Text Extraction: Extract text with accurate Unicode mapping and ToUnicode CMap support
  • Layout Analysis: Multi-column detection with XY-Cut and DBSCAN clustering
  • Format Conversion: Convert to Markdown, HTML, and plain text
  • Image Extraction: Extract embedded images (JPEG, PNG) with metadata
  • Structure Tree: Parse PDF logical structure (tagged PDFs)
  • Annotations: Extract PDF annotations, comments, and highlights
  • Bookmarks: Extract document outline/bookmarks with hierarchy
  • Python Bindings: Easy-to-use Python API via PyO3

§Planned for v1.0

  • ML Integration: Advanced layout analysis with ONNX models
  • Table Detection: Production-ready ML-based table extraction
  • OCR: Text extraction from scanned PDFs via Tesseract
  • WASM Target: Run in browsers via WebAssembly
  • Digital Signatures: Signature verification and creation

§Quick Start

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

// Open a PDF
let mut doc = PdfDocument::open("paper.pdf")?;

// Extract text from first page
let text = doc.extract_text(0)?;
println!("{}", text);

// Convert to Markdown
let options = ConversionOptions::default();
let markdown = doc.to_markdown(0, &options)?;

// Extract images
let images = doc.extract_images(0)?;
```ignore

# Python Usage

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)
markdown = doc.to_markdown(0)
```ignore

# License

Licensed under either of:

* Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or <http://www.apache.org/licenses/LICENSE-2.0>)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or <http://opensource.org/licenses/MIT>)

at your option.

Re-exports§

pub use annotations::Annotation;
pub use annotations::LinkAction;
pub use annotations::LinkDestination;
pub use config::PdfConfig;
pub use document::ExtractedImageRef;
pub use document::ImageFormat;
pub use document::PdfDocument;
pub use error::Error;
pub use error::Result;
pub use outline::Destination;
pub use outline::OutlineItem;

Modules§

annotations
PDF annotations support.
config
Configuration for PDF processing.
content
PDF content stream parsing and execution.
converters
Format converters for PDF documents.
decoders
Stream decoder implementations for PDF filters.
document
PDF document model.
encryption
PDF encryption support.
error
Error types for the PDF library.
extractors
Text and content extraction from PDF documents.
fonts
Font handling and encoding.
geometry
Geometric primitives for layout analysis.
hybrid
Hybrid classical + ML architecture.
images
Image extraction.
layout
Layout analysis algorithms for PDF documents.
lexer
PDF lexer (tokenizer).
object
PDF object types.
objstm
Object stream parsing (PDF 1.5+).
outline
PDF document outline (bookmarks) support.
parser
PDF object parser.
parser_config
Parser configuration options
structure
PDF logical structure (Tagged PDFs) PDF Logical Structure (Tagged PDF) support.
xref
Cross-reference table parser.
xref_reconstruction
Cross-reference table reconstruction for damaged PDFs.

Constants§

NAME
Library name
VERSION
Library version