unpdf
A high-performance Rust library for extracting content from PDF documents to structured Markdown, plain text, and JSON.
Features
- Comprehensive PDF support: PDF 1.0-2.0, including compressed object streams
- Multiple output formats: Markdown, Plain Text, JSON (with full metadata)
- Structure preservation: Headings, paragraphs, lists, tables, inline formatting
- CJK text support: Smart spacing for Korean, Chinese, Japanese content
- Asset extraction: Images, fonts, and embedded resources
- Text cleanup: Multiple presets for LLM training data preparation
- Self-update: Built-in update mechanism via GitHub releases
- C-ABI FFI: Native library for C#, Python, and other languages
- Parallel processing: Uses Rayon for multi-page documents
Table of Contents
- Installation
- CLI Usage
- Rust Library Usage
- Python Integration
- C# / .NET Integration
- Output Formats
- Feature Flags
- License
Installation
Pre-built Binaries (Recommended)
Download the latest release from GitHub Releases.
Windows (x64)
# Download and extract
Invoke-WebRequest -Uri "https://github.com/iyulab/unpdf/releases/latest/download/unpdf-cli-x86_64-pc-windows-msvc.zip" -OutFile "unpdf.zip"
Expand-Archive -Path "unpdf.zip" -DestinationPath "."
# Move to a directory in PATH (optional)
Move-Item -Path "unpdf.exe" -Destination "$env:LOCALAPPDATA\Microsoft\WindowsApps\"
# Verify installation
unpdf version
Linux (x64)
# Download and extract
# Install to /usr/local/bin (requires sudo)
# Or install to user directory
# Verify installation
macOS
# Intel Mac
# Apple Silicon (M1/M2/M3/M4)
# Install
# Verify
Available Binaries
| Platform | Architecture | File |
|---|---|---|
| Windows | x64 | unpdf-cli-x86_64-pc-windows-msvc.zip |
| Linux | x64 | unpdf-cli-x86_64-unknown-linux-gnu.tar.gz |
| macOS | Intel | unpdf-cli-x86_64-apple-darwin.tar.gz |
| macOS | Apple Silicon | unpdf-cli-aarch64-apple-darwin.tar.gz |
Updating
unpdf includes a built-in self-update mechanism:
# Check for updates
# Update to latest version
# Force reinstall (even if on latest)
Install via Cargo
If you have Rust installed:
# Install CLI
# Add library to your project
CLI Usage
Quick Start
# Extract all formats (Markdown, text, JSON) + images to output directory
# Specify output directory
# With text cleanup for LLM training
Output Structure
document_output/
├── extract.md # Markdown output with frontmatter
├── extract.txt # Plain text output
├── content.json # Full structured JSON
└── images/ # Extracted images
├── page1_img1.png
└── page2_img1.jpg
Commands
Convert to Markdown
# Basic conversion (output to stdout)
# Save to file
# With YAML frontmatter
# With text cleanup for LLM training
# Table rendering options
# Specify page range
Markdown Options
| Option | Description | Default |
|---|---|---|
-o, --output |
Output file path | stdout |
-f, --frontmatter |
Include YAML frontmatter | false |
--table-mode |
Table rendering: markdown, html, ascii |
markdown |
--cleanup |
Text cleanup: minimal, standard, aggressive |
none |
--max-heading |
Maximum heading level (1-6) | 6 |
--pages |
Page range (e.g., 1-10, 1,3,5) |
all |
Convert to Plain Text
# Basic extraction
# With cleanup
# Specific pages
Convert to JSON
# Pretty-printed JSON
# Compact JSON
Show Document Information
Output:
Document Information
────────────────────────────────────────
File: document.pdf
Format: PDF 1.7
Pages: 42
Encrypted: No
Title: My Document
Author: John Doe
Creator: Microsoft Word
Producer: Adobe PDF Library
Created: 2025-01-15T10:30:00Z
Modified: 2025-01-20T14:45:00Z
Content Statistics
────────────────────────────────────────
Words: 12500
Characters: 75000
Images: 15
Extract Images
# Extract to current directory
# Extract to specific directory
# Extract specific pages
Self-Update
# Check for updates
# Update to latest version
# Force reinstall
Examples
# Convert PDF to Markdown with frontmatter
# Extract text from scanned PDF (requires OCR feature)
# Convert with aggressive cleanup for AI training
# Batch conversion (shell)
for; do ; done
# Batch conversion (PowerShell)
|
Rust Library Usage
Quick Start
use ;
Render Options
use ;
let options = new
.with_frontmatter
.with_table_fallback
.with_cleanup_preset
.with_max_heading
.with_page_range;
let markdown = to_markdown?;
Working with Document Structure
use parse_file;
let doc = parse_file?;
// Access metadata
println!;
println!;
println!;
println!;
// Iterate pages
for in doc.pages.iter.enumerate
// Extract images
for in &doc.resources
Page Range Selection
use ;
// Parse only specific pages
let doc = parse_file_with_options?;
// Or parse all and render specific pages
let doc = parse_file?;
let options = new
.with_pages;
let markdown = to_markdown?;
Handling Encrypted PDFs
use ;
match parse_file
Python Integration
Install the Python package:
Basic Usage
# Convert PDF to Markdown
=
# Convert to plain text
=
# Convert to JSON
=
# Get document information
=
Working with Bytes
# Read PDF from file
=
# Convert from bytes
=
Check PDF Validity
# Check if file is a valid PDF
=
C# / .NET Integration
unpdf provides C-ABI compatible bindings for integration with C# and .NET applications.
Installation via NuGet
Or via Package Manager Console:
Install-Package Unpdf
Getting the Native Library (Manual)
Alternatively, download from GitHub Releases:
| Platform | Library File |
|---|---|
| Windows x64 | unpdf.dll |
| Linux x64 | libunpdf.so |
| macOS | libunpdf.dylib |
Or build from source:
C# Wrapper Usage
using Unpdf;
// Convert PDF to Markdown
string markdown = Pdf.ToMarkdown("document.pdf");
// Convert PDF to plain text
string text = Pdf.ToText("document.pdf");
// Convert PDF to JSON
string json = Pdf.ToJson("document.pdf", pretty: true);
// Get document information
var info = Pdf.GetInfo("document.pdf");
Console.WriteLine($"Title: {info.Title}, Pages: {info.PageCount}");
// Convert with options
var options = new PdfOptions
{
IncludeFrontmatter = true,
ExtractImages = true,
ImageOutputDir = "./images",
Lenient = true
};
string markdownWithImages = Pdf.ToMarkdown("document.pdf", options);
// Extract images only
var images = Pdf.ExtractImages("document.pdf", "./output/images");
foreach (var img in images)
{
Console.WriteLine($"{img.Filename}: {img.Width}x{img.Height}");
}
ASP.NET Core Example
[ApiController]
[Route("api/[controller]")]
public class PdfController : ControllerBase
{
[HttpPost("convert")]
public async Task<IActionResult> ConvertPdf(IFormFile file)
{
if (file == null) return BadRequest("No file");
// Save to temp file for processing
var tempPath = Path.GetTempFileName();
try
{
using (var stream = System.IO.File.Create(tempPath))
{
await file.CopyToAsync(stream);
}
var options = new PdfOptions { IncludeFrontmatter = true };
var markdown = Pdf.ToMarkdown(tempPath, options);
return Ok(new { markdown });
}
catch (UnpdfException ex)
{
return BadRequest(new { error = ex.Message });
}
finally
{
if (System.IO.File.Exists(tempPath))
System.IO.File.Delete(tempPath);
}
}
[HttpPost("extract-images")]
public async Task<IActionResult> ExtractImages(IFormFile file)
{
if (file == null) return BadRequest("No file");
var tempPath = Path.GetTempFileName();
var outputDir = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString());
try
{
using (var stream = System.IO.File.Create(tempPath))
{
await file.CopyToAsync(stream);
}
var images = Pdf.ExtractImages(tempPath, outputDir);
return Ok(new { count = images.Count, images });
}
catch (UnpdfException ex)
{
return BadRequest(new { error = ex.Message });
}
finally
{
if (System.IO.File.Exists(tempPath))
System.IO.File.Delete(tempPath);
}
}
}
Output Formats
Markdown
Structured Markdown with preserved formatting:
- Headings: Document headings (detected from font size/style) ->
#,##,### - Paragraphs: Text blocks with proper spacing
- Lists: Detected ordered and unordered lists
- Tables: Markdown tables (with HTML/ASCII fallback for complex layouts)
- Inline styles: Bold (
**), italic (*) - Hyperlinks: Preserved as Markdown links
- Images: Reference-style image links
Plain Text
Pure text content without formatting markers.
JSON
Complete document structure with metadata:
Supported PDF Features
| Feature | Status |
|---|---|
| PDF 1.0 - 2.0 | Supported |
| Compressed object streams | Supported |
| Text extraction | Supported |
| Table detection | Supported |
| Image extraction | Supported |
| Hyperlinks | Supported |
| Bookmarks/Outlines | Supported |
| Password-protected (user password) | Supported |
| Password-protected (owner password) | Supported |
| AES-256 encryption | Supported |
| Digital signatures | Metadata only |
| Form fields (AcroForms) | Planned |
| XFA forms | Planned |
| OCR (image-based PDFs) | Planned (feature flag) |
Feature Flags
| Feature | Description | Default |
|---|---|---|
default |
Core PDF parsing and rendering | Yes |
ffi |
C-ABI foreign function interface | No |
async |
Async I/O with Tokio | No |
ocr |
OCR support via Tesseract | No |
# Cargo.toml - enable features
[]
= { = "0.1", = ["ffi", "async"] }
Performance
- Parallel page processing with Rayon
- Efficient PDF parsing with lopdf
- Memory-efficient handling of large documents
- Streaming support for very large files
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.