edgequake-pdf2md 0.1.0

edgequake-pdf2md is a Rust CLI and library that converts PDF files (local or URL) into well-structured Markdown using vision-capable LLMs. It rasterises each page with pdfium, sends the image to a VLM (GPT-4.1, Claude, Gemini, etc.), and post-processes the result into clean Markdown.

Inspired by pyzerox, rebuilt in Rust for speed and reliability.

Features

Multi-provider — OpenAI, Anthropic, Google Gemini, Azure, Ollama, or any OpenAI-compatible endpoint
Fast — concurrent page processing with configurable parallelism
Accurate — 10-rule post-processing pipeline fixes tables, removes hallucinations, normalises output
Flexible — page selection, fidelity tiers, custom system prompts, streaming API
Cross-platform — macOS (ARM/x64), Linux (x64/ARM64/musl), Windows
Library + CLI — use as a Rust crate or standalone command-line tool

Quick Start

1. Install pdfium

# Auto-detect OS & architecture
./scripts/setup-pdfium.sh

# macOS: set library path
export DYLD_LIBRARY_PATH="$(pwd)"

# Linux: set library path
export LD_LIBRARY_PATH="$(pwd)"

See docs/installation.md for manual install options (Homebrew, apt, manual download).

2. Set an API key

export OPENAI_API_KEY="sk-..."    # OpenAI (recommended)
# or
export ANTHROPIC_API_KEY="sk-ant-..."  # Anthropic
# or
export GEMINI_API_KEY="AI..."          # Google Gemini

3. Build & run

cargo build --release --features cli

# Convert a PDF
./target/release/pdf2md document.pdf -o output.md

# Convert from URL
./target/release/pdf2md https://arxiv.org/pdf/1706.03762 -o paper.md

# Inspect metadata (no API key needed)
./target/release/pdf2md --inspect-only document.pdf

Or install globally:

cargo install --path . --features cli
pdf2md document.pdf -o output.md

How It Works

PDF ──▶ pdfium ──▶ PNG images ──▶ base64 ──▶ VLM API ──▶ post-process ──▶ Markdown
        render      per page       encode     (concurrent)   10 rules       assembled

Input — resolve local file or download from URL
Render — rasterise pages to images via pdfium-render
Encode — base64-encode each page image
VLM — send images to a vision LLM with a structured system prompt
Post-process — strip fences, fix tables, remove hallucinated images, normalise whitespace
Assemble — join pages with optional separators and YAML front-matter

See docs/how-it-works.md for the full pipeline walkthrough with diagrams.

Usage

# Basic conversion
pdf2md document.pdf -o output.md

# Specific pages
pdf2md --pages 1-5 document.pdf -o first_five.md

# High fidelity with a better model
pdf2md --fidelity tier3 --model gpt-4.1 --dpi 200 paper.pdf -o paper.md

# Consistent formatting across pages (sequential mode)
pdf2md --maintain-format --separator hr book.pdf -o book.md

# JSON output with metadata
pdf2md --json --metadata document.pdf > output.json

# Use Anthropic
pdf2md --provider anthropic --model claude-sonnet-4-20250514 document.pdf

# Use local Ollama
pdf2md --provider ollama --model llava document.pdf

Run pdf2md --help for the full reference, including supported models and cost estimates.

See docs/examples.md for more usage patterns.

Supported Providers & Models

Provider	Model	Input $/1M	Output $/1M	Vision
OpenAI	gpt-4.1-nano (default)	$0.10	$0.40	✓
OpenAI	gpt-4.1-mini	$0.40	$1.60	✓
OpenAI	gpt-4.1	$2.00	$8.00	✓
Anthropic	claude-sonnet-4-20250514	$3.00	$15.00	✓
Anthropic	claude-haiku-4-20250514	$0.80	$4.00	✓
Gemini	gemini-2.0-flash	$0.10	$0.40	✓
Gemini	gemini-2.5-pro	$1.25	$10.00	✓
Ollama	llava, llama3.2-vision	free	free	✓

Cost estimate: A 50-page document costs ~$0.02 with gpt-4.1-nano, ~$0.09 with gpt-4.1-mini.

See docs/providers.md for detailed comparisons, cost calculators, and selection guide.

Library Usage

Add to your Cargo.toml:

[dependencies]
edgequake-pdf2md = "0.1"
tokio = { version = "1", features = ["full"] }

use edgequake_pdf2md::{convert, ConversionConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let config = ConversionConfig::builder()
        .model("gpt-4.1-nano")
        .provider_name("openai")
        .pages(edgequake_pdf2md::PageSelection::Range(1, 5))
        .build()?;

    let output = convert("document.pdf", &config).await?;
    println!("{}", output.markdown);
    println!("Processed {}/{} pages", output.stats.processed_pages, output.stats.total_pages);
    Ok(())
}

Also available: streaming API (convert_stream), sync wrapper (convert_sync), metadata inspection (inspect).

See API docs on docs.rs for the full API reference.

Configuration

All options can be set via CLI flags, environment variables, or the builder API:

Flag	Env Variable	Default	Description
`--model`	`EDGEQUAKE_MODEL`	gpt-4.1-nano	VLM model
`--provider`	`EDGEQUAKE_PROVIDER`	auto-detect	LLM provider
`--dpi`	`PDF2MD_DPI`	150	Rendering resolution (72–400)
`--pages`	`PDF2MD_PAGES`	all	Page selection
`--fidelity`	`PDF2MD_FIDELITY`	tier2	Quality tier (tier1/tier2/tier3)
`-c, --concurrency`	`PDF2MD_CONCURRENCY`	10	Parallel API calls
`--maintain-format`	`PDF2MD_MAINTAIN_FORMAT`	false	Sequential mode
`--separator`	`PDF2MD_SEPARATOR`	none	Page separator
`--temperature`	`PDF2MD_TEMPERATURE`	0.1	LLM temperature

See docs/configuration.md for the complete reference.

Development

# Setup
make setup          # Check pdfium + API key

# Build
make build          # Release binary
make build-dev      # Debug binary

# Test
make test           # Unit tests (no API key needed)
make test-e2e       # Integration tests (needs API key)
make test-all       # All tests

# Quality
make lint           # Clippy
make fmt            # Format code
make ci             # format + lint + unit tests

# Try it
make demo           # Convert sample page
make inspect-all    # Inspect test PDFs

Documentation

Document	Description
docs/how-it-works.md	Pipeline architecture with ASCII diagrams
docs/installation.md	Setup guide for all platforms
docs/providers.md	Supported models, pricing, selection guide
docs/configuration.md	All CLI flags and environment variables
docs/examples.md	Real-world usage examples

Dependencies

Crate	Purpose
pdfium-render	PDF rasterisation via Google's pdfium C++ library
edgequake-llm	Multi-provider LLM abstraction (OpenAI, Anthropic, Gemini, etc.)
tokio	Async runtime
image	Image encoding (PNG/JPEG)
clap	CLI argument parsing

External References

pdfium — Google's open-source PDF rendering engine
pdfium-binaries — Pre-built pdfium binaries for all platforms
pyzerox — The Python project that inspired this tool
OpenAI Vision API — Image understanding with GPT-4.1
Anthropic Vision — Image understanding with Claude
Google Gemini — Vision capabilities

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

See LICENSE for the full text.