unpdf-cli 0.4.3

CLI tool for extracting PDF content to Markdown, text, and JSON
unpdf-cli-0.4.3 is not a library.

unpdf

Crates.io PyPI NuGet Documentation CI License: MIT

A high-performance Rust library for extracting content from PDF documents to structured Markdown, plain text, and JSON.

Features

  • Comprehensive PDF support: PDF 1.0-2.0, including compressed object streams
  • Encrypted PDF support: RC4 and AES-128 decryption (auto-tries empty password)
  • Multiple output formats: Markdown, Plain Text, JSON (with full metadata)
  • Structure preservation: Headings, paragraphs, lists, tables, inline formatting
  • CJK text support: Smart spacing for Korean, Chinese, Japanese with Adobe CMap resources
  • RTL text support: Arabic and Hebrew with Unicode BiDi reordering
  • Form field extraction: AcroForm fields (text, checkbox, radio, dropdown) with values
  • Multi-column layout: Recursive XY-Cut algorithm for N-column detection
  • Asset extraction: Images, fonts, and embedded resources
  • Extraction quality diagnostics: Automatic detection and reporting of extraction issues
  • Text cleanup: Multiple presets for LLM training data preparation
  • Self-update: Built-in update mechanism via GitHub releases
  • C-ABI FFI: Native library for C#, Python, and other languages
  • Parallel processing: Uses Rayon for multi-page documents
  • Streaming pipeline (0.4.0+): PdfParser::for_each_page yields pages as they parse; peak memory bounded by window size regardless of document size
  • Deterministic page ordering: Parallel page parsing emits results in page_num ASC order via internal reorder buffer

Table of Contents


Installation

Pre-built Binaries (Recommended)

Download the latest release from GitHub Releases.

Windows (x64)

# Download and extract
Invoke-WebRequest -Uri "https://github.com/iyulab/unpdf/releases/latest/download/unpdf-cli-x86_64-pc-windows-msvc.zip" -OutFile "unpdf.zip"
Expand-Archive -Path "unpdf.zip" -DestinationPath "."

# Move to a directory in PATH (optional)
Move-Item -Path "unpdf.exe" -Destination "$env:LOCALAPPDATA\Microsoft\WindowsApps\"

# Verify installation
unpdf version

Linux (x64)

# Download and extract
curl -LO https://github.com/iyulab/unpdf/releases/latest/download/unpdf-cli-x86_64-unknown-linux-gnu.tar.gz
tar -xzf unpdf-cli-x86_64-unknown-linux-gnu.tar.gz

# Install to /usr/local/bin (requires sudo)
sudo mv unpdf /usr/local/bin/

# Or install to user directory
mkdir -p ~/.local/bin
mv unpdf ~/.local/bin/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Verify installation
unpdf version

macOS

# Intel Mac
curl -LO https://github.com/iyulab/unpdf/releases/latest/download/unpdf-cli-x86_64-apple-darwin.tar.gz
tar -xzf unpdf-cli-x86_64-apple-darwin.tar.gz

# Apple Silicon (M1/M2/M3/M4)
curl -LO https://github.com/iyulab/unpdf/releases/latest/download/unpdf-cli-aarch64-apple-darwin.tar.gz
tar -xzf unpdf-cli-aarch64-apple-darwin.tar.gz

# Install
sudo mv unpdf /usr/local/bin/

# Verify
unpdf version

Available Binaries

Platform Architecture File
Windows x64 unpdf-cli-x86_64-pc-windows-msvc.zip
Linux x64 unpdf-cli-x86_64-unknown-linux-gnu.tar.gz
macOS Intel unpdf-cli-x86_64-apple-darwin.tar.gz
macOS Apple Silicon unpdf-cli-aarch64-apple-darwin.tar.gz

Updating

unpdf includes a built-in self-update mechanism:

# Check for updates
unpdf update --check

# Update to latest version
unpdf update

# Force reinstall (even if on latest)
unpdf update --force

Install via Cargo

If you have Rust installed:

# Install CLI
cargo install unpdf-cli

# Add library to your project
cargo add unpdf

CLI Usage

Quick Start

# Extract all formats (Markdown, text, JSON) + images to output directory
unpdf document.pdf

# Specify output directory
unpdf document.pdf ./output

# With text cleanup for LLM training
unpdf document.pdf --cleanup aggressive

Output Structure

document_output/
├── extract.md      # Markdown output with frontmatter
├── extract.txt     # Plain text output
├── content.json    # Full structured JSON
└── images/         # Extracted images
    ├── page1_img1.png
    └── page2_img1.jpg

Commands

unpdf <file> [output]              # Extract all formats (default)
unpdf convert <file> [OPTIONS]     # Same as above, explicit command
unpdf markdown <file> [OPTIONS]    # Convert to Markdown only (alias: md)
unpdf text <file> [OPTIONS]        # Convert to plain text only
unpdf json <file> [OPTIONS]        # Convert to JSON only
unpdf info <file>                  # Show document information
unpdf extract <file> [OPTIONS]     # Extract images only
unpdf update [OPTIONS]             # Self-update to latest version
unpdf version                      # Show version information

Convert to Markdown

# Basic conversion (output to stdout)
unpdf markdown document.pdf

# Save to file
unpdf markdown document.pdf -o output.md

# With YAML frontmatter
unpdf markdown document.pdf --frontmatter -o output.md

# With text cleanup for LLM training
unpdf markdown document.pdf --cleanup standard -o cleaned.md

# Table rendering options
unpdf markdown document.pdf --table-mode html -o output.md

# Specify page range
unpdf markdown document.pdf --pages 1-10 -o output.md

Markdown Options

Option Description Default
-o, --output Output file path stdout
-f, --frontmatter Include YAML frontmatter false
--table-mode Table rendering: markdown, html, ascii markdown
--cleanup Text cleanup: minimal, standard, aggressive none
--max-heading Maximum heading level (1-6) 6
--pages Page range (e.g., 1-10, 1,3,5) all
-q, --quiet Suppress quality warnings false

Convert to Plain Text

# Basic extraction
unpdf text document.pdf

# With cleanup
unpdf text document.pdf --cleanup standard -o output.txt

# Specific pages
unpdf text document.pdf --pages 1-5 -o output.txt

Convert to JSON

# Pretty-printed JSON
unpdf json document.pdf -o output.json

# Compact JSON
unpdf json document.pdf --compact -o output.json

Show Document Information

unpdf info document.pdf

Output:

Document Information
────────────────────────────────────────
File: document.pdf
Format: PDF 1.7
Pages: 42
Encrypted: No
Title: My Document
Author: John Doe
Creator: Microsoft Word
Producer: Adobe PDF Library
Created: 2025-01-15T10:30:00Z
Modified: 2025-01-20T14:45:00Z

Content Statistics
────────────────────────────────────────
Words: 12500
Characters: 75000
Images: 15

Extract Images

# Extract to current directory
unpdf extract document.pdf

# Extract to specific directory
unpdf extract document.pdf -o ./images

# Extract specific pages
unpdf extract document.pdf --pages 1-5 -o ./images

Self-Update

# Check for updates
unpdf update --check

# Update to latest version
unpdf update

# Force reinstall
unpdf update --force

Examples

# Convert PDF to Markdown with frontmatter
unpdf md report.pdf --frontmatter -o report.md

# Extract text from scanned PDF (requires OCR feature)
unpdf text scanned.pdf --ocr -o output.txt

# Convert with aggressive cleanup for AI training
unpdf md document.pdf --cleanup aggressive -o cleaned.md

# Batch conversion (shell)
for f in *.pdf; do unpdf md "$f" -o "${f%.pdf}.md"; done

# Batch conversion (PowerShell)
Get-ChildItem *.pdf | ForEach-Object { unpdf md $_.FullName -o "$($_.BaseName).md" }

Rust Library Usage

Quick Start

use unpdf::{parse_file, render};

fn main() -> unpdf::Result<()> {
    // Parse PDF document
    let doc = parse_file("document.pdf")?;

    // Convert to Markdown
    let options = render::RenderOptions::default();
    let markdown = render::to_markdown(&doc, &options)?;
    println!("{}", markdown);

    // Get plain text
    let text = render::to_text(&doc, &options)?;

    // Get JSON
    let json = render::to_json(&doc, render::JsonFormat::Pretty)?;

    Ok(())
}

Render Options

use unpdf::render::{RenderOptions, CleanupPreset, TableFallback};

let options = RenderOptions::new()
    .with_frontmatter(true)
    .with_table_fallback(TableFallback::Html)
    .with_cleanup_preset(CleanupPreset::Aggressive)
    .with_max_heading(3)
    .with_page_range(1..=10);

let markdown = render::to_markdown(&doc, &options)?;

Working with Document Structure

use unpdf::parse_file;

let doc = parse_file("document.pdf")?;

// Access metadata
println!("Title: {:?}", doc.metadata.title);
println!("Author: {:?}", doc.metadata.author);
println!("Pages: {}", doc.page_count());
println!("PDF Version: {}", doc.metadata.pdf_version);

// Iterate pages
for (page_num, page) in doc.pages.iter().enumerate() {
    println!("Page {}: {} blocks", page_num + 1, page.elements.len());
    for element in &page.elements {
        // Process paragraphs, tables, images, etc.
    }
}

// Extract images
for (id, resource) in &doc.resources {
    if resource.is_image() {
        let filename = resource.suggested_filename(id);
        std::fs::write(&filename, &resource.data)?;
    }
}

Page Range Selection

use unpdf::{parse_file, PageSelection};

// Parse only specific pages
let doc = parse_file_with_options("large.pdf", ParseOptions {
    pages: PageSelection::Range(1..=10),
    ..Default::default()
})?;

// Or parse all and render specific pages
let doc = parse_file("document.pdf")?;
let options = RenderOptions::new()
    .with_pages(vec![1, 3, 5, 7]);
let markdown = render::to_markdown(&doc, &options)?;

Handling Encrypted PDFs

unpdf automatically decrypts PDFs that use empty user passwords (owner-password-only protection). For password-protected PDFs, provide the password:

use unpdf::{parse_file, parse_file_with_options, ParseOptions};

// Auto-decrypts owner-password-only PDFs
let doc = parse_file("restricted.pdf")?;

// Provide a password for user-password-protected PDFs
let options = ParseOptions::new().with_password("secret");
let doc = parse_file_with_options("protected.pdf", options)?;

// Check extraction quality
if let Some(warning) = doc.extraction_quality.warning_message() {
    eprintln!("{}", warning);
}

Working with Form Fields

use unpdf::parse_file;

let doc = parse_file("form.pdf")?;
for field in &doc.form_fields {
    println!("{}: {}", field.name, field.display_value());
}

Python Integration

Install the Python package:

pip install unpdf-markdown

Basic Usage

from unpdf import to_markdown, to_text, to_json, get_info

# Convert PDF to Markdown
markdown = to_markdown("document.pdf")

# Convert to plain text
text = to_text("document.pdf")

# Convert to JSON
json_data = to_json("document.pdf")

# Get document information
info = get_info("document.pdf")
print(f"Title: {info.get('title')}")
print(f"Pages: {info.get('page_count')}")

Working with Bytes

from unpdf import to_markdown

# Read PDF from file
with open("document.pdf", "rb") as f:
    pdf_bytes = f.read()

# Convert from bytes
markdown = to_markdown(pdf_bytes)

Check PDF Validity

from unpdf import is_pdf

# Check if file is a valid PDF
if is_pdf("document.pdf"):
    markdown = to_markdown("document.pdf")

C# / .NET Integration

unpdf provides C-ABI compatible bindings for integration with C# and .NET applications.

Installation via NuGet

dotnet add package Unpdf

Or via Package Manager Console:

Install-Package Unpdf

Getting the Native Library (Manual)

Alternatively, download from GitHub Releases:

Platform Library File
Windows x64 unpdf.dll
Linux x64 libunpdf.so
macOS libunpdf.dylib

Or build from source:

cargo build --release --features ffi

C# Wrapper Usage

using Unpdf;

// Convert PDF to Markdown
string markdown = Pdf.ToMarkdown("document.pdf");

// Convert PDF to plain text
string text = Pdf.ToText("document.pdf");

// Convert PDF to JSON
string json = Pdf.ToJson("document.pdf", pretty: true);

// Get document information
var info = Pdf.GetInfo("document.pdf");
Console.WriteLine($"Title: {info.Title}, Pages: {info.PageCount}");

// Convert with options
var options = new PdfOptions
{
    IncludeFrontmatter = true,
    ExtractImages = true,
    ImageOutputDir = "./images",
    Lenient = true
};
string markdownWithImages = Pdf.ToMarkdown("document.pdf", options);

// Extract images only
var images = Pdf.ExtractImages("document.pdf", "./output/images");
foreach (var img in images)
{
    Console.WriteLine($"{img.Filename}: {img.Width}x{img.Height}");
}

ASP.NET Core Example

[ApiController]
[Route("api/[controller]")]
public class PdfController : ControllerBase
{
    [HttpPost("convert")]
    public async Task<IActionResult> ConvertPdf(IFormFile file)
    {
        if (file == null) return BadRequest("No file");

        // Save to temp file for processing
        var tempPath = Path.GetTempFileName();
        try
        {
            using (var stream = System.IO.File.Create(tempPath))
            {
                await file.CopyToAsync(stream);
            }

            var options = new PdfOptions { IncludeFrontmatter = true };
            var markdown = Pdf.ToMarkdown(tempPath, options);
            return Ok(new { markdown });
        }
        catch (UnpdfException ex)
        {
            return BadRequest(new { error = ex.Message });
        }
        finally
        {
            if (System.IO.File.Exists(tempPath))
                System.IO.File.Delete(tempPath);
        }
    }

    [HttpPost("extract-images")]
    public async Task<IActionResult> ExtractImages(IFormFile file)
    {
        if (file == null) return BadRequest("No file");

        var tempPath = Path.GetTempFileName();
        var outputDir = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString());
        try
        {
            using (var stream = System.IO.File.Create(tempPath))
            {
                await file.CopyToAsync(stream);
            }

            var images = Pdf.ExtractImages(tempPath, outputDir);
            return Ok(new { count = images.Count, images });
        }
        catch (UnpdfException ex)
        {
            return BadRequest(new { error = ex.Message });
        }
        finally
        {
            if (System.IO.File.Exists(tempPath))
                System.IO.File.Delete(tempPath);
        }
    }
}

Output Formats

Markdown

Structured Markdown with preserved formatting:

  • Headings: Document headings (detected from font size/style) -> #, ##, ###
  • Paragraphs: Text blocks with proper spacing
  • Lists: Detected ordered and unordered lists
  • Tables: Markdown tables (with HTML/ASCII fallback for complex layouts)
  • Inline styles: Bold (**), italic (*)
  • Hyperlinks: Preserved as Markdown links
  • Images: Reference-style image links

Plain Text

Pure text content without formatting markers.

JSON

Complete document structure with metadata:

{
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "creator": "Application Name",
    "producer": "PDF Library",
    "pdf_version": "1.7",
    "page_count": 42,
    "created": "2025-01-15T10:30:00Z",
    "modified": "2025-01-20T14:45:00Z"
  },
  "pages": [...],
  "resources": [...]
}

Supported PDF Features

Feature Status
PDF 1.0 - 2.0 Supported
Compressed object streams (ObjStm) Supported
Cross-reference streams (XRef streams) Supported
Linearized PDFs Supported
Encrypted PDFs (RC4, AES-128) Supported
Text extraction Supported
CJK text (Korean, Chinese, Japanese) Supported (Adobe CMap)
RTL text (Arabic, Hebrew) Supported (BiDi)
CIDFont / ToUnicode CMap decoding Supported
Embedded TrueType font decoding Supported
Multi-column layout detection Supported (XY-Cut)
Table detection Supported
Form fields (AcroForms) Supported
Image extraction (JPEG, JP2) Supported
Bookmarks/Outlines Supported
Extraction quality diagnostics Supported
AES-256 encryption (R5-R6) Not yet supported
Digital signatures Metadata only
OCR (image-based PDFs) Planned

Feature Flags

Feature Description Default
default Core PDF parsing and rendering Yes
ffi C-ABI foreign function interface No
async Async I/O with Tokio No
# Cargo.toml - enable features
[dependencies]
unpdf = { version = "0.2", features = ["ffi", "async"] }

Performance

  • Custom zero-dependency PDF parser (no external C libraries)
  • Parallel page processing with Rayon
  • Memory-efficient handling of large documents
  • Streaming support for very large files

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Related Projects

  • unhwp - Korean HWP document extraction
  • undoc - Microsoft Office document extraction (DOCX, XLSX, PPTX)