pdf_oxide_cli 0.3.15

CLI for pdf-oxide — the fastest PDF toolkit. 22 commands: text extraction, PDF to markdown, search, merge, split, images, compress, encrypt, watermark, forms, and more.
Documentation

pdf-oxide — The Fastest PDF CLI Toolkit

A command-line tool for PDF text extraction, markdown conversion, search, merge, split, image extraction, and more. Built on pdf_oxide, the fastest Rust PDF library (0.8ms mean, 100% pass rate on 3,830 PDFs). MIT licensed.

Crates.io License: MIT OR Apache-2.0

Install

brew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)
cargo install pdf_oxide_cli             # Cargo
cargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall

Quick Start

pdf-oxide text report.pdf                      # Extract text
pdf-oxide markdown report.pdf -o report.md     # Convert to Markdown
pdf-oxide html report.pdf -o report.html       # Convert to HTML
pdf-oxide search report.pdf "neural.?network"  # Regex search
pdf-oxide images report.pdf -o ./images/       # Extract images
pdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs
pdf-oxide split report.pdf -o ./pages/         # Split into pages

All Commands

Command Description
text Extract text from PDF pages
markdown Convert PDF to Markdown with headings, lists, and layout
html Convert PDF to HTML
search Search PDF content with regex patterns
images Extract images to files (PNG, JPEG, etc.)
info Show PDF metadata, page count, and version
metadata Read and write PDF metadata fields
merge Combine multiple PDFs into one
split Split PDF into individual pages
compress Reduce PDF file size
encrypt Password-protect a PDF
decrypt Remove password from a PDF
rotate Rotate pages by 90, 180, or 270 degrees
crop Set page crop box dimensions
delete Remove specific pages
reorder Rearrange page order
watermark Add text watermark to pages
flatten Flatten form fields and annotations
forms Read and fill PDF form fields
bookmarks Extract document bookmarks/outline
create Create new PDF documents programmatically

Features

  • 22 commands for complete PDF processing from the terminal
  • Fast — powered by pdf_oxide, 5x faster than PyMuPDF
  • PDF to Markdown — headings, bullet lists, column-aware reading order
  • Regex search — full regex pattern matching across pages
  • Image extraction — extracts images from content streams, form XObjects, and inline images
  • Form filling — read and write PDF form fields from the command line
  • Page range support — use --pages 1-5,10 on any command
  • JSON output — add --json for machine-readable results
  • Interactive REPL — run pdf-oxide with no arguments for interactive mode
  • Encrypted PDFs — supply --password to open protected files
  • Cross-platform — Linux, macOS, and Windows

Usage Examples

Extract text from specific pages

pdf-oxide text paper.pdf --pages 1-5
pdf-oxide text paper.pdf --pages 1,3,7-10

Convert to Markdown for LLM/RAG pipelines

pdf-oxide markdown paper.pdf -o paper.md
pdf-oxide markdown paper.pdf --pages 1 --detect-headings

Search across a PDF

pdf-oxide search contract.pdf "termination|cancellation"
pdf-oxide search paper.pdf "equation \d+" --json

Merge and split

pdf-oxide merge chapter1.pdf chapter2.pdf chapter3.pdf -o book.pdf
pdf-oxide split book.pdf -o ./chapters/

Work with forms

pdf-oxide forms w2.pdf                              # List fields
pdf-oxide forms w2.pdf --fill "employee_name=Jane"   # Fill fields

Extract images

pdf-oxide images paper.pdf -o ./figures/ --pages 1-10

Performance

pdf_oxide processes PDFs at 0.8ms mean per document — 5x faster than PyMuPDF, 15x faster than pypdf. Text extraction, markdown conversion, and all operations share the same high-performance Rust core.

Documentation

Related Crates

License

MIT OR Apache-2.0