Skip to main content

Crate crw_pdf

Crate crw_pdf 

Source
Expand description

Smart PDF detection and text extraction using lopdf

§Quick start

// Full processing (detect + extract + markdown) with defaults
let result = crw_pdf::process_pdf("document.pdf").unwrap();
println!("type: {:?}, pages: {}", result.pdf_type, result.page_count);
if let Some(md) = &result.markdown {
    println!("{md}");
}

// Fast metadata-only detection (no text extraction)
let info = crw_pdf::detect_pdf("document.pdf").unwrap();
println!("type: {:?}, pages: {}", info.pdf_type, info.page_count);

// Custom options via builder
use crw_pdf::{PdfOptions, ProcessMode};
let result = crw_pdf::process_pdf_with_options(
    "document.pdf",
    PdfOptions::new().mode(ProcessMode::Analyze),
).unwrap();

Re-exports§

pub use detector::detect_pdf_type;
pub use detector::detect_pdf_type_mem;
pub use detector::detect_pdf_type_mem_with_config;
pub use detector::detect_pdf_type_with_config;
pub use detector::DetectionConfig;
pub use detector::PdfType;
pub use detector::PdfTypeResult;
pub use detector::ScanStrategy;
pub use extractor::extract_text;
pub use extractor::extract_text_with_positions;
pub use extractor::extract_text_with_positions_pages;
pub use markdown::to_markdown;
pub use markdown::to_markdown_from_items;
pub use markdown::to_markdown_from_items_with_rects;
pub use markdown::MarkdownOptions;
pub use process_mode::ProcessMode;
pub use types::LayoutComplexity;
pub use types::PdfLine;
pub use types::PdfRect;
pub use types::TextItem;

Modules§

adobe_korea1
Adobe-Korea1 CID-to-Unicode mapping table.
detector
Smart PDF type detection without full document load
extractor
Text extraction from PDF using lopdf
glyph_names
Glyph name to Unicode mapping
markdown
Markdown conversion with structure detection.
process_mode
structure_tree
Tagged PDF structure tree parser.
tables
Table detection and formatting.
text_utils
Character classification and text utility functions.
tounicode
ToUnicode CMap parsing for PDF text extraction
types
Shared types used across the extraction and markdown pipelines.

Structs§

PdfOptions
Configuration for process_pdf_with_options and friends.
PdfProcessResult
High-level PDF processing result.

Enums§

PdfError

Functions§

detect_pdf
Fast metadata-only detection — no text extraction or markdown generation.
detect_pdf_mem
Fast metadata-only detection from a memory buffer.
process_pdf
Process a PDF file with full extraction (detect → extract → markdown).
process_pdf_mem
Process a PDF from a memory buffer with full extraction.
process_pdf_mem_with_configDeprecated
Process PDF from memory buffer with custom detection and markdown configuration.
process_pdf_mem_with_options
Process a PDF from a memory buffer with custom options.
process_pdf_with_configDeprecated
Process a PDF file with custom detection and markdown configuration.
process_pdf_with_config_pagesDeprecated
Process a PDF file with custom configuration and optional page filter.
process_pdf_with_options
Process a PDF file with custom options.