Skip to main content

Crate unpdf

Crate unpdf 

Source
Expand description

§unpdf

High-performance PDF content extraction library for Rust.

This library extracts content from PDF documents and converts it to structured formats like Markdown, plain text, and JSON.

§Quick Start

use unpdf::{parse_file, render};

fn main() -> unpdf::Result<()> {
    // Parse a PDF file
    let doc = parse_file("document.pdf")?;

    // Convert to Markdown
    let options = render::RenderOptions::default();
    let markdown = render::to_markdown(&doc, &options)?;
    println!("{}", markdown);

    Ok(())
}

§Features

  • Multiple output formats: Markdown, plain text, JSON
  • Structure preservation: Headings, paragraphs, tables, lists
  • Asset extraction: Images and embedded resources
  • CJK support: Korean, Chinese, Japanese text handling
  • Parallel processing: Uses Rayon for multi-page documents
  • Cleanup pipeline: Text normalization for LLM training data

Re-exports§

pub use convert::ConvertOptions;
pub use convert::ConvertResult;
pub use convert::ConverterRegistry;
pub use convert::DocumentConverter;
pub use convert::OutputFormat;
pub use detect::detect_format_from_bytes;
pub use detect::detect_format_from_path;
pub use detect::is_pdf;
pub use detect::PdfFormat;
pub use error::Error;
pub use error::Result;
pub use model::Alignment;
pub use model::Block;
pub use model::Document;
pub use model::InlineContent;
pub use model::ListInfo;
pub use model::Metadata;
pub use model::Outline;
pub use model::Page;
pub use model::Paragraph;
pub use model::ParagraphStyle;
pub use model::Resource;
pub use model::ResourceType;
pub use model::Table;
pub use model::TableCell;
pub use model::TableRow;
pub use model::TextRun;
pub use model::TextStyle;
pub use parser::ParseOptions;
pub use parser::PdfParser;
pub use render::CleanupOptions;
pub use render::CleanupPreset;
pub use render::HeadingConfig;
pub use render::JsonFormat;
pub use render::PageSelection;
pub use render::RenderOptions;
pub use render::TableFallback;

Modules§

convert
Document converter module providing a plugin architecture for multiple formats.
detect
PDF format detection and validation.
error
Error types for unpdf library.
model
Document model types for PDF content representation.
parser
PDF parsing module.
render
Rendering module for converting documents to various output formats.

Structs§

Unpdf
Builder for parsing and converting PDF documents.
UnpdfResult
Result of parsing a PDF document.

Functions§

extract_text
Extract plain text from a PDF file.
parse_bytes
Parse a PDF from bytes.
parse_bytes_with_options
Parse a PDF from bytes with custom options.
parse_file
Parse a PDF file and return a structured document.
parse_file_with_options
Parse a PDF file with custom options.
parse_file_with_password
Parse a password-protected PDF file.
parse_reader
Parse a PDF from a reader.
parse_reader_with_options
Parse a PDF from a reader with custom options.
to_json
Convert a PDF to JSON.
to_markdown
Convert a PDF to Markdown.
to_markdown_with_options
Convert a PDF to Markdown with custom options.
to_text
Convert a PDF to plain text with cleanup.