Expand description
§unpdf
High-performance PDF content extraction library for Rust.
This library extracts content from PDF documents and converts it to structured formats like Markdown, plain text, and JSON.
§Quick Start
use unpdf::{parse_file, render};
fn main() -> unpdf::Result<()> {
// Parse a PDF file
let doc = parse_file("document.pdf")?;
// Convert to Markdown
let options = render::RenderOptions::default();
let markdown = render::to_markdown(&doc, &options)?;
println!("{}", markdown);
Ok(())
}§Features
- Multiple output formats: Markdown, plain text, JSON
- Structure preservation: Headings, paragraphs, tables, lists
- Asset extraction: Images and embedded resources
- CJK support: Korean, Chinese, Japanese text handling
- Parallel processing: Uses Rayon for multi-page documents
- Cleanup pipeline: Text normalization for LLM training data
Re-exports§
pub use convert::ConvertOptions;pub use convert::ConvertResult;pub use convert::ConverterRegistry;pub use convert::DocumentConverter;pub use convert::OutputFormat;pub use detect::detect_format_from_bytes;pub use detect::detect_format_from_path;pub use detect::is_pdf;pub use detect::PdfFormat;pub use error::Error;pub use error::Result;pub use model::Alignment;pub use model::Block;pub use model::Document;pub use model::InlineContent;pub use model::ListInfo;pub use model::Metadata;pub use model::Outline;pub use model::Page;pub use model::Paragraph;pub use model::ParagraphStyle;pub use model::Resource;pub use model::ResourceType;pub use model::Table;pub use model::TableCell;pub use model::TableRow;pub use model::TextRun;pub use model::TextStyle;pub use parser::ParseOptions;pub use parser::PdfParser;pub use render::CleanupOptions;pub use render::CleanupPreset;pub use render::HeadingConfig;pub use render::JsonFormat;pub use render::PageSelection;pub use render::RenderOptions;pub use render::TableFallback;
Modules§
- convert
- Document converter module providing a plugin architecture for multiple formats.
- detect
- PDF format detection and validation.
- error
- Error types for unpdf library.
- model
- Document model types for PDF content representation.
- parser
- PDF parsing module.
- render
- Rendering module for converting documents to various output formats.
Structs§
- Unpdf
- Builder for parsing and converting PDF documents.
- Unpdf
Result - Result of parsing a PDF document.
Functions§
- extract_
text - Extract plain text from a PDF file.
- parse_
bytes - Parse a PDF from bytes.
- parse_
bytes_ with_ options - Parse a PDF from bytes with custom options.
- parse_
file - Parse a PDF file and return a structured document.
- parse_
file_ with_ options - Parse a PDF file with custom options.
- parse_
file_ with_ password - Parse a password-protected PDF file.
- parse_
reader - Parse a PDF from a reader.
- parse_
reader_ with_ options - Parse a PDF from a reader with custom options.
- to_json
- Convert a PDF to JSON.
- to_
markdown - Convert a PDF to Markdown.
- to_
markdown_ with_ options - Convert a PDF to Markdown with custom options.
- to_text
- Convert a PDF to plain text with cleanup.