1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
//! Plain text extraction focused on simplicity
//!
//! This module provides simplified text extraction without position metadata.
//! It's designed for use cases where you need plain text content without
//! detailed layout information, offering a simpler API and output format
//! compared to the standard TextExtractor.
//!
//! # Overview
//!
//! The plain text extractor returns `String` and `Vec<String>` instead of
//! `Vec<TextFragment>`, making it easier to use for text analysis and search
//! indexing where position data is not required.
//!
//! # Use Cases
//!
//! - **Full-text search**: Extract text for indexing without position data
//! - **Content analysis**: Analyze document content without layout
//! - **Text classification**: Feed text to ML models for categorization
//! - **Simple grep operations**: Extract line-by-line for pattern matching
//! - **Copy-paste text extraction**: Get readable text for clipboard use
//!
//! # API Design
//!
//! - **Simple output**: Returns `String` and `Vec<String>`, not `Vec<TextFragment>`
//! - **Comparable performance**: Uses same content stream parser as TextExtractor
//! - Memory efficient (no position data stored)
//! - Configurable space/newline detection
//!
//! # Quick Start
//!
//! ```ignore
//! use oxidize_pdf::Document;
//! use oxidize_pdf::text::plaintext::PlainTextExtractor;
//!
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! // Open PDF document
//! let doc = Document::open("document.pdf")?;
//! let page = doc.get_page(1)?;
//!
//! // Extract plain text (default configuration)
//! let extractor = PlainTextExtractor::new();
//! let result = extractor.extract(&doc, page)?;
//!
//! println!("Extracted {} characters in {} lines",
//! result.char_count,
//! result.line_count
//! );
//! println!("{}", result.text);
//! # Ok(())
//! # }
//! ```
//!
//! # Configuration
//!
//! ```ignore
//! use oxidize_pdf::text::plaintext::{PlainTextExtractor, PlainTextConfig, LineBreakMode};
//!
//! let extractor = PlainTextExtractor::with_config(PlainTextConfig {
//! space_threshold: 0.3, // More sensitive space detection
//! newline_threshold: 12.0, // Higher threshold for line breaks
//! preserve_layout: true, // Keep original whitespace
//! line_break_mode: LineBreakMode::Normalize, // Join hyphenated words
//! });
//! ```
//!
//! # Line-by-Line Extraction
//!
//! For grep-like operations or line-based processing:
//!
//! ```ignore
//! use oxidize_pdf::Document;
//! use oxidize_pdf::text::plaintext::PlainTextExtractor;
//!
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! let doc = Document::open("document.pdf")?;
//! let page = doc.get_page(1)?;
//!
//! let extractor = PlainTextExtractor::new();
//! let lines = extractor.extract_lines(&doc, page)?;
//!
//! for line in lines {
//! println!("{}", line);
//! }
//! # Ok(())
//! # }
//! ```
//!
//! # Comparison with TextExtractor
//!
//! | Feature | PlainTextExtractor | TextExtractor |
//! |---------|-------------------|---------------|
//! | Output Type | String, Vec\<String\> | Vec\<TextFragment\> |
//! | Position Data | ❌ No | ✅ Yes (x, y coordinates) |
//! | Font Metadata | ❌ No | ✅ Yes (size, family) |
//! | API Complexity | **Simple** | More detailed |
//! | Performance | Comparable | Comparable |
//! | Use Case | Text search, indexing | Layout analysis, tables |
//!
//! # Limitations
//!
//! - No precise position information (x, y coordinates)
//! - No font metadata (size, family)
//! - Basic whitespace detection (configurable thresholds)
//! - No multi-column layout detection
//!
//! For layout-aware extraction, use `TextExtractor` instead.
pub use PlainTextExtractor;
pub use ;