Crate crw_pdf

Expand description

Smart PDF detection and text extraction using lopdf

§Quick start

// Full processing (detect + extract + markdown) with defaults
let result = crw_pdf::process_pdf("document.pdf").unwrap();
println!("type: {:?}, pages: {}", result.pdf_type, result.page_count);
if let Some(md) = &result.markdown {
    println!("{md}");
}

// Fast metadata-only detection (no text extraction)
let info = crw_pdf::detect_pdf("document.pdf").unwrap();
println!("type: {:?}, pages: {}", info.pdf_type, info.page_count);

// Custom options via builder
use crw_pdf::{PdfOptions, ProcessMode};
let result = crw_pdf::process_pdf_with_options(
    "document.pdf",
    PdfOptions::new().mode(ProcessMode::Analyze),
).unwrap();

Re-exports§

pub use detector::detect_pdf_type;
pub use detector::detect_pdf_type_mem;
pub use detector::detect_pdf_type_mem_with_config;
pub use detector::detect_pdf_type_with_config;
pub use detector::DetectionConfig;
pub use detector::PdfType;
pub use detector::PdfTypeResult;
pub use detector::ScanStrategy;
pub use extractor::extract_text;
pub use extractor::extract_text_with_positions;
pub use extractor::extract_text_with_positions_pages;
pub use markdown::to_markdown;
pub use markdown::to_markdown_from_items;
pub use markdown::to_markdown_from_items_with_rects;
pub use markdown::MarkdownOptions;
pub use process_mode::ProcessMode;
pub use types::LayoutComplexity;
pub use types::PdfLine;
pub use types::PdfRect;
pub use types::TextItem;

Modules§

adobe_korea1: Adobe-Korea1 CID-to-Unicode mapping table.
detector: Smart PDF type detection without full document load
extractor: Text extraction from PDF using lopdf
glyph_names: Glyph name to Unicode mapping
markdown: Markdown conversion with structure detection.
process_mode
structure_tree: Tagged PDF structure tree parser.
tables: Table detection and formatting.
text_utils: Character classification and text utility functions.
tounicode: ToUnicode CMap parsing for PDF text extraction
types: Shared types used across the extraction and markdown pipelines.

Structs§

PdfOptions: Configuration for process_pdf_with_options and friends.
PdfProcessResult: High-level PDF processing result.

Enums§

PdfError

Functions§

detect_pdf: Fast metadata-only detection — no text extraction or markdown generation.
detect_pdf_mem: Fast metadata-only detection from a memory buffer.
process_pdf: Process a PDF file with full extraction (detect → extract → markdown).
process_pdf_mem: Process a PDF from a memory buffer with full extraction.
process_pdf_mem_with_configDeprecated: Process PDF from memory buffer with custom detection and markdown configuration.
process_pdf_mem_with_options: Process a PDF from a memory buffer with custom options.
process_pdf_with_configDeprecated: Process a PDF file with custom detection and markdown configuration.
process_pdf_with_config_pagesDeprecated: Process a PDF file with custom configuration and optional page filter.
process_pdf_with_options: Process a PDF file with custom options.

Crate crw_pdf

Crate crw_pdf Copy item path

§Quick start

Re-exports§

Modules§

Structs§

Enums§

Functions§

Crate crw_pdf