pdf_oxide 0.3.24

# Getting Started with PDFOxide (Python)

PDFOxide is the complete PDF toolkit. One library for extracting, creating, and editing PDFs with a unified API. Built on a Rust core for maximum performance.

## Installation

```bash
pip install pdf_oxide
```

## Quick Start - The Unified `Pdf` API

The `Pdf` class is your main entry point for all PDF operations:

```python
from pdf_oxide import Pdf

# Create from Markdown
pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")
```

## Creating PDFs

### From Markdown

```python
from pdf_oxide import Pdf

pdf = Pdf.from_markdown("""
# Report Title

## Introduction

This is **bold** and *italic* text.

- Item 1
- Item 2
- Item 3

## Code Example

```python
print("Hello, World!")
```
""")
pdf.save("report.pdf")
```

### From HTML

```python
from pdf_oxide import Pdf

pdf = Pdf.from_html("""
<h1>Invoice</h1>
<p>Thank you for your purchase.</p>
<table>
    <tr><th>Item</th><th>Price</th></tr>
    <tr><td>Widget</td><td>$10.00</td></tr>
</table>
""")
pdf.save("invoice.pdf")
```

### From Plain Text

```python
from pdf_oxide import Pdf

pdf = Pdf.from_text("Simple plain text document.\n\nWith paragraphs.")
pdf.save("notes.pdf")
```

### From Images

```python
from pdf_oxide import Pdf

# Single image
pdf = Pdf.from_image("photo.jpg")
pdf.save("photo.pdf")

# Multiple images (one per page)
album = Pdf.from_images(["page1.jpg", "page2.png", "page3.jpg"])
album.save("album.pdf")
```

## Opening and Reading PDFs

```python
from pdf_oxide import PdfDocument

# Open existing PDF (path can be str or pathlib.Path)
doc = PdfDocument("document.pdf")

# Or use as a context manager
with PdfDocument("document.pdf") as doc:
    text = doc.extract_text(0)
    print(f"Text: {text}")
    markdown = doc.to_markdown(0)
    print(f"Pages: {doc.page_count()}")

# Extract text from page 0
text = doc.extract_text(0)
print(f"Text: {text}")

# Convert to Markdown
markdown = doc.to_markdown(0)
print(f"Markdown:\n{markdown}")

# Get page count
print(f"Pages: {doc.page_count()}")
```

## Builder Pattern for Advanced Creation

For full control over PDF creation, use `PdfBuilder`:

```python
from pdf_oxide import PdfBuilder, PageSize

pdf = (PdfBuilder()
    .title("Annual Report 2025")
    .author("Company Inc.")
    .subject("Financial Summary")
    .page_size(PageSize.A4)
    .margins(72.0, 72.0, 72.0, 72.0)  # 1 inch margins
    .font_size(11.0)
    .from_markdown("# Annual Report\n\n..."))

pdf.save("annual-report.pdf")
```

## Encryption and Security

### Password Protection

```python
from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Confidential Document")

# Simple password protection (AES-256)
pdf.save_encrypted("secure.pdf", "user-password", "owner-password")
```

## Text Extraction Options

### Basic Extraction

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)
```

### With Options

```python
from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("paper.pdf")
options = ConversionOptions(
    detect_headings=True,
    detect_lists=True,
    embed_images=True
)
markdown = doc.to_markdown(0, options)
```

### Extract All Pages

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Extract text from all pages
all_text = doc.extract_text_all()

# Convert entire document to Markdown
all_markdown = doc.to_markdown_all()
```

## Office Document Conversion

Convert DOCX, XLSX, and PPTX files to PDF:

```python
from pdf_oxide import OfficeConverter

# Auto-detect format
converter = OfficeConverter()

# Convert Word document
converter.convert("report.docx", "report.pdf")

# Convert Excel spreadsheet
converter.convert("data.xlsx", "data.pdf")

# Convert PowerPoint presentation
converter.convert("slides.pptx", "slides.pdf")
```

## Working with Images

### Extract Images from PDF

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("document.pdf")
images = doc.extract_images(0)

for i, img in enumerate(images):
    img.save(f"image_{i}.png")
```

### Embed Images in Output

```python
from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("paper.pdf")
options = ConversionOptions(embed_images=True)

# Images embedded as base64 data URIs
html = doc.to_html(0, options)
```

## OCR - Extracting Text from Scanned PDFs

> For a comprehensive guide covering model selection, configuration reference, resize strategies, and troubleshooting, see the [OCR Guide](OCR_GUIDE.md).

### Setup

The OCR feature requires heavy machine learning dependencies (ONNX Runtime) and is optional.

```bash
# Recommended: Install with OCR support
pip install pdf_oxide[ocr]

# Or build from source with OCR
maturin develop --features python,ocr
```

> **Troubleshooting:** If you see `RuntimeError: OCR feature not enabled`, it means the library was installed without OCR support. Re-install using the `[ocr]` extra above.

**Quick start** — download the recommended models:

```bash
./scripts/setup_ocr_models.sh
```

#### Model Selection Guide

PDFOxide supports PaddleOCR v3, v4, and v5 models. You can mix detection and recognition models from different versions.

| Combination | Detection | Recognition | English Accuracy | Total Size |
|---|---|---|---|---|
| **V4 det + V5 rec (recommended)** | ch_PP-OCRv4_det | en_PP-OCRv5_mobile_rec | Best | ~12.5 MB |
| V4 det + V4 rec | ch_PP-OCRv4_det | en_PP-OCRv4_rec | Good | ~12.4 MB |
| V5 det + V5 rec | PP-OCRv5_server_det | en_PP-OCRv5_mobile_rec | Good (different errors) | ~96 MB |
| V3 det + V3 rec | en_PP-OCRv3_det | en_PP-OCRv3_rec | Fair | ~11 MB |

The **V4 detection + V5 recognition** combination gives the best results for English documents: V4 detection reliably segments text lines, while V5 recognition has the highest character-level accuracy.

**Manual download:**

```bash
# Recommended: V4 detection + V5 recognition
# Detection (4.7 MB):
curl -L https://huggingface.co/deepghs/paddleocr/resolve/main/det/ch_PP-OCRv4_det/model.onnx -o .models/det.onnx

# Recognition (7.8 MB):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/rec.onnx -o .models/rec.onnx

# Dictionary (must include space as last entry):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/dict.txt -o .models/en_dict.txt
echo " " >> .models/en_dict.txt
```

### Basic OCR Usage

```python
from pdf_oxide import PdfDocument, OcrEngine, OcrConfig

# Create OCR engine (default config works with recommended V4 det + V5 rec models)
engine = OcrEngine(
    det_model_path=".models/det.onnx",
    rec_model_path=".models/rec.onnx",
    dict_path=".models/en_dict.txt",
)

# Extract text using OCR
doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(page=0, engine=engine)
print(text)
```

### Processing Multiple Pages

```python
doc = PdfDocument("scanned.pdf")
for page in range(doc.page_count()):
    text = doc.extract_text_ocr(page=page, engine=engine)
    if text.strip():
        print(f"--- Page {page + 1} ---")
        print(text)
```

### Using PP-OCRv5 Detection

If you use the full PP-OCRv5 stack (v5 detection + v5 recognition), pass `use_v5=True` to `OcrConfig`. This preserves the original image resolution instead of downscaling to 960px, which the larger v5 detection model needs:

```python
config = OcrConfig(use_v5=True)
engine = OcrEngine(
    det_model_path="v5_det.onnx",
    rec_model_path="v5_rec.onnx",
    dict_path="v5_dict.txt",
    config=config,
)
```

> **Note:** The `OcrEngine` is reusable — create it once and pass it to multiple `extract_text_ocr` calls. ONNX Runtime requires `libonnxruntime.so` (v1.23+) to be available at runtime (via `LD_LIBRARY_PATH` or system install).

## Structured Extraction

Beyond plain text, PDFOxide can extract structured content from pages:

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("document.pdf")

# 1. Scoped extraction from a specific area (v0.3.14)
# Area: (x, y, width, height) in points
header = doc.within(0, (0, 700, 612, 92)).extract_text()

# 2. Text spans with font info, position, and style
spans = doc.extract_spans(0)
for span in spans:
    print(f"{span.text} — {span.font_name} {span.font_size}pt")

# 3. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
    print(f"Word: {w.text} at {w.bbox}")
    # Access character metadata for the word
    # print(w.chars[0].font_name)

# Optional: override the adaptive word gap threshold (in PDF points).
# Smaller values split more aggressively; useful for dense forms.
words = doc.extract_words(0, word_gap_threshold=2.5)

# 4. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
    print(f"Line: {line.text}")

# Optional: override word and/or line gap thresholds (in PDF points).
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# 5. Inspect computed layout params before overriding
params = doc.page_layout_params(0)
print(f"Adaptive word gap: {params.word_gap_threshold:.1f}pt")
print(f"Adaptive line gap: {params.line_gap_threshold:.1f}pt")

# 6. Pre-tuned extraction profiles for different document types
from pdf_oxide import ExtractionProfile
profile = ExtractionProfile.form()
print(f"Profile: {profile.name}, word_margin_ratio={profile.word_margin_ratio}")

# Pass a profile to extraction methods to control how raw text is parsed
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# Combine profile with threshold overrides (profile controls span parsing,
# thresholds control word/line clustering)
words = doc.extract_words(0, word_gap_threshold=1.5, profile=ExtractionProfile.aggressive())

# 7. Image metadata
images = doc.extract_images(0)
for img in images:
    print(f"{img['width']}x{img['height']} {img['color_space']}")

# 8. Bookmarks / table of contents
outline = doc.get_outline()  # None if no outline
if outline:
    for item in outline:
        print(f"{item['title']} -> page {item.get('page')}")

# 9. Vector paths (lines, curves, shapes)
paths = doc.extract_paths(0)
for path in paths:
    print(f"bbox={path['bbox']}, stroke={path.get('stroke_color')}")
```

## Working with Form Fields

PDFOxide can extract, read, fill, and export PDF form field data (AcroForm fields).

### List All Form Fields

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("tax-form.pdf")
fields = doc.get_form_fields()

for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")
```

Each `FormField` has:
- `name` — fully qualified field name (e.g. `"topmostSubform[0].Page1[0].f1_01[0]"`)
- `field_type` — `"text"`, `"button"`, `"choice"`, `"signature"`
- `value` — current value (`str`, `bool`, or `None`)
- `flags` — field flags (read-only, required, etc.)

### Read and Set Field Values

```python
doc = PdfDocument("w2.pdf")

# Read a field value
ssn = doc.get_form_field_value("topmostSubform[0].CopyA[0].f1_01[0]")
print(f"SSN: {ssn}")

# Fill fields
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.set_form_field_value("retirement_plan", True)  # checkbox

# Save (values are persisted via incremental save)
doc.save("filled_w2.pdf")
```

### Extract Text with Form Field Values

Filled form field values appear inline in `extract_text` and `to_markdown`:

```python
doc = PdfDocument("filled_w2.pdf")

# Form values appear inline in extracted text
text = doc.extract_text(0)
print(text)  # "Jane Doe" appears where the name field is

# to_markdown includes form fields by default
md = doc.to_markdown(0, include_form_fields=True)

# Exclude form field values
md_clean = doc.to_markdown(0, include_form_fields=False)
```

### Export Form Data

```python
doc = PdfDocument("filled-form.pdf")

# Export as FDF
doc.export_form_data("form_data.fdf")

# Export as XFDF
doc.export_form_data("form_data.xfdf", format="xfdf")
```

## Performance Tips

1. **Reuse document objects** - Opening a PDF has overhead, reuse the object for multiple operations
2. **Use specific page extraction** - `extract_text(page_num)` is faster than `extract_text_all()` if you only need some pages
3. **Disable features you don't need** - Use `ConversionOptions` to skip heading detection, image extraction, etc.

```python
from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("large.pdf")

# Fast extraction - minimal processing
options = ConversionOptions(
    detect_headings=False,
    detect_lists=False,
    embed_images=False
)
text = doc.to_markdown(0, options)
```

## Error Handling

```python
from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("document.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")
```

## Examples

See the [examples/](../examples/) directory for complete working examples.

### Quick Script Examples

**Extract text from all PDFs in a folder:**

```python
from pdf_oxide import PdfDocument
from pathlib import Path

for pdf_path in Path("documents").glob("*.pdf"):
    # PdfDocument accepts pathlib.Path directly
    with PdfDocument(pdf_path) as doc:
        text = doc.extract_text_all()

    output_path = pdf_path.with_suffix(".txt")
    output_path.write_text(text)
    print(f"Extracted: {pdf_path.name}")
```

**Batch convert Markdown to PDF:**

```python
from pdf_oxide import Pdf
from pathlib import Path

for md_path in Path("notes").glob("*.md"):
    content = md_path.read_text()
    pdf = Pdf.from_markdown(content)

    output_path = md_path.with_suffix(".pdf")
    pdf.save(str(output_path))
    print(f"Created: {output_path.name}")
```

## Next Steps

- [API Reference](https://docs.rs/pdf_oxide) - Full API documentation
- [PDF Creation Guide](PDF_CREATION_GUIDE.md) - Advanced creation options
- [GitHub Issues](https://github.com/yfedoseev/pdf_oxide/issues) - Report bugs or request features