Expand description
Smart PDF detection and text extraction using lopdf
§Quick start
// Full processing (detect + extract + markdown) with defaults
let result = crw_pdf::process_pdf("document.pdf").unwrap();
println!("type: {:?}, pages: {}", result.pdf_type, result.page_count);
if let Some(md) = &result.markdown {
println!("{md}");
}
// Fast metadata-only detection (no text extraction)
let info = crw_pdf::detect_pdf("document.pdf").unwrap();
println!("type: {:?}, pages: {}", info.pdf_type, info.page_count);
// Custom options via builder
use crw_pdf::{PdfOptions, ProcessMode};
let result = crw_pdf::process_pdf_with_options(
"document.pdf",
PdfOptions::new().mode(ProcessMode::Analyze),
).unwrap();Re-exports§
pub use detector::detect_pdf_type;pub use detector::detect_pdf_type_mem;pub use detector::detect_pdf_type_mem_with_config;pub use detector::detect_pdf_type_with_config;pub use detector::DetectionConfig;pub use detector::PdfType;pub use detector::PdfTypeResult;pub use detector::ScanStrategy;pub use extractor::extract_text;pub use extractor::extract_text_with_positions;pub use extractor::extract_text_with_positions_pages;pub use markdown::to_markdown;pub use markdown::to_markdown_from_items;pub use markdown::to_markdown_from_items_with_rects;pub use markdown::MarkdownOptions;pub use process_mode::ProcessMode;pub use types::LayoutComplexity;pub use types::PdfLine;pub use types::PdfRect;pub use types::TextItem;
Modules§
- adobe_
korea1 - Adobe-Korea1 CID-to-Unicode mapping table.
- detector
- Smart PDF type detection without full document load
- extractor
- Text extraction from PDF using lopdf
- glyph_
names - Glyph name to Unicode mapping
- markdown
- Markdown conversion with structure detection.
- process_
mode - structure_
tree - Tagged PDF structure tree parser.
- tables
- Table detection and formatting.
- text_
utils - Character classification and text utility functions.
- tounicode
- ToUnicode CMap parsing for PDF text extraction
- types
- Shared types used across the extraction and markdown pipelines.
Structs§
- PdfOptions
- Configuration for
process_pdf_with_optionsand friends. - PdfProcess
Result - High-level PDF processing result.
Enums§
Functions§
- detect_
pdf - Fast metadata-only detection — no text extraction or markdown generation.
- detect_
pdf_ mem - Fast metadata-only detection from a memory buffer.
- process_
pdf - Process a PDF file with full extraction (detect → extract → markdown).
- process_
pdf_ mem - Process a PDF from a memory buffer with full extraction.
- process_
pdf_ mem_ with_ config Deprecated - Process PDF from memory buffer with custom detection and markdown configuration.
- process_
pdf_ mem_ with_ options - Process a PDF from a memory buffer with custom options.
- process_
pdf_ with_ config Deprecated - Process a PDF file with custom detection and markdown configuration.
- process_
pdf_ with_ config_ pages Deprecated - Process a PDF file with custom configuration and optional page filter.
- process_
pdf_ with_ options - Process a PDF file with custom options.