oxidize-pdf 2.5.0

//! Plain text extraction focused on simplicity
//!
//! This module provides simplified text extraction without position metadata.
//! It's designed for use cases where you need plain text content without
//! detailed layout information, offering a simpler API and output format
//! compared to the standard TextExtractor.
//!
//! # Overview
//!
//! The plain text extractor returns `String` and `Vec<String>` instead of
//! `Vec<TextFragment>`, making it easier to use for text analysis and search
//! indexing where position data is not required.
//!
//! # Use Cases
//!
//! - **Full-text search**: Extract text for indexing without position data
//! - **Content analysis**: Analyze document content without layout
//! - **Text classification**: Feed text to ML models for categorization
//! - **Simple grep operations**: Extract line-by-line for pattern matching
//! - **Copy-paste text extraction**: Get readable text for clipboard use
//!
//! # API Design
//!
//! - **Simple output**: Returns `String` and `Vec<String>`, not `Vec<TextFragment>`
//! - **Comparable performance**: Uses same content stream parser as TextExtractor
//! - Memory efficient (no position data stored)
//! - Configurable space/newline detection
//!
//! # Quick Start
//!
//! ```ignore
//! use oxidize_pdf::Document;
//! use oxidize_pdf::text::plaintext::PlainTextExtractor;
//!
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! // Open PDF document
//! let doc = Document::open("document.pdf")?;
//! let page = doc.get_page(1)?;
//!
//! // Extract plain text (default configuration)
//! let extractor = PlainTextExtractor::new();
//! let result = extractor.extract(&doc, page)?;
//!
//! println!("Extracted {} characters in {} lines",
//!     result.char_count,
//!     result.line_count
//! );
//! println!("{}", result.text);
//! # Ok(())
//! # }
//! ```
//!
//! # Configuration
//!
//! ```ignore
//! use oxidize_pdf::text::plaintext::{PlainTextExtractor, PlainTextConfig, LineBreakMode};
//!
//! let extractor = PlainTextExtractor::with_config(PlainTextConfig {
//!     space_threshold: 0.3,           // More sensitive space detection
//!     newline_threshold: 12.0,        // Higher threshold for line breaks
//!     preserve_layout: true,           // Keep original whitespace
//!     line_break_mode: LineBreakMode::Normalize, // Join hyphenated words
//! });
//! ```
//!
//! # Line-by-Line Extraction
//!
//! For grep-like operations or line-based processing:
//!
//! ```ignore
//! use oxidize_pdf::Document;
//! use oxidize_pdf::text::plaintext::PlainTextExtractor;
//!
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! let doc = Document::open("document.pdf")?;
//! let page = doc.get_page(1)?;
//!
//! let extractor = PlainTextExtractor::new();
//! let lines = extractor.extract_lines(&doc, page)?;
//!
//! for line in lines {
//!     println!("{}", line);
//! }
//! # Ok(())
//! # }
//! ```
//!
//! # Comparison with TextExtractor
//!
//! | Feature | PlainTextExtractor | TextExtractor |
//! |---------|-------------------|---------------|
//! | Output Type | String, Vec\<String\> | Vec\<TextFragment\> |
//! | Position Data | ❌ No | ✅ Yes (x, y coordinates) |
//! | Font Metadata | ❌ No | ✅ Yes (size, family) |
//! | API Complexity | **Simple** | More detailed |
//! | Performance | Comparable | Comparable |
//! | Use Case | Text search, indexing | Layout analysis, tables |
//!
//! # Limitations
//!
//! - No precise position information (x, y coordinates)
//! - No font metadata (size, family)
//! - Basic whitespace detection (configurable thresholds)
//! - No multi-column layout detection
//!
//! For layout-aware extraction, use `TextExtractor` instead.

mod extractor;
mod types;

pub use extractor::PlainTextExtractor;
pub use types::{LineBreakMode, PlainTextConfig, PlainTextResult};