1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
//! Invoice data extraction from PDF documents
//!
//! This module provides a complete system for extracting structured invoice data from PDF
//! documents using pattern matching and confidence scoring.
//!
//! # Overview
//!
//! The invoice extraction system automatically identifies and extracts common invoice fields
//! such as invoice numbers, dates, amounts, and tax information from PDF pages. It supports
//! multiple languages and provides confidence scores for each extracted field.
//!
//! # Supported Languages
//!
//! - Spanish (ES) - "Factura", "CIF", "Base Imponible", "IVA"
//! - English (EN) - "Invoice", "VAT Number", "Subtotal", "Total"
//! - German (DE) - "Rechnung", "USt-IdNr.", "Nettobetrag", "MwSt."
//! - Italian (IT) - "Fattura", "Partita IVA", "Imponibile", "IVA"
//!
//! # Supported Fields
//!
//! - Invoice Number
//! - Invoice Date & Due Date
//! - Total Amount, Tax Amount, Net Amount
//! - VAT/Tax ID Numbers
//! - Supplier & Customer Names
//! - Currency
//! - Line Items (Article Number, Description, Quantity, Unit Price)
//!
//! # Quick Start
//!
//! ```ignore
//! use oxidize_pdf::Document;
//! use oxidize_pdf::text::extraction::{TextExtractor, ExtractionOptions};
//! use oxidize_pdf::text::invoice::InvoiceExtractor;
//!
//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
//! // 1. Open PDF document
//! let doc = Document::open("invoice.pdf")?;
//! let page = doc.get_page(1)?;
//!
//! // 2. Extract text from page
//! let text_extractor = TextExtractor::new();
//! let extracted = text_extractor.extract_text(&doc, page, &ExtractionOptions::default())?;
//!
//! // 3. Extract invoice data
//! let invoice_extractor = InvoiceExtractor::builder()
//! .with_language("es") // Spanish invoices
//! .confidence_threshold(0.7) // 70% minimum confidence
//! .build();
//!
//! let invoice = invoice_extractor.extract(&extracted.fragments)?;
//!
//! // 4. Access extracted fields
//! println!("Found {} fields with {:.0}% confidence",
//! invoice.field_count(),
//! invoice.metadata.extraction_confidence * 100.0
//! );
//!
//! for field in &invoice.fields {
//! println!("{}: {:?} ({:.0}% confidence)",
//! field.field_type.name(),
//! field.field_type,
//! field.confidence * 100.0
//! );
//! }
//! # Ok(())
//! # }
//! ```
//!
//! # Architecture
//!
//! The extraction process follows a pipeline:
//!
//! ```text
//! PDF Page → Text Extraction → Pattern Matching → Type Conversion → Filtered Results
//! ```
//!
//! 1. **Text Extraction**: Extract raw text fragments from PDF
//! 2. **Pattern Matching**: Apply language-specific regex patterns
//! 3. **Type Conversion**: Convert strings to typed data (amounts, dates)
//! 4. **Confidence Scoring**: Calculate confidence for each match
//! 5. **Filtering**: Remove low-confidence results
//!
//! # Confidence Scores
//!
//! Each field has a confidence score (0.0 to 1.0):
//!
//! - **0.9**: Critical fields (invoice number, total amount)
//! - **0.8**: Important fields (dates, tax amounts)
//! - **0.7**: Standard fields (VAT numbers, names)
//!
//! Fields below the configured threshold are automatically filtered out.
//!
//! # Multi-Language Support
//!
//! Language selection affects:
//! - Pattern matching (field labels and formats)
//! - Number parsing (1.234,56 vs 1,234.56)
//! - Date formats (DD/MM/YYYY vs DD.MM.YYYY)
//!
//! # Examples
//!
//! See the `examples/` directory for complete examples:
//! - `invoice_extraction_basic.rs` - Simple extraction with field display
//! - `invoice_extraction_advanced.rs` - Batch processing and JSON export
//!
//! # Limitations (MVP)
//!
//! - Single-page invoices only
//! - No line item extraction (coming soon)
//! - No validation or calculation verification
//! - Dates stored as strings (not parsed to Date type)
//!
//! # Performance
//!
//! Extraction typically completes in <100ms for standard invoices.
//! The extractor is thread-safe and can be reused across multiple pages.
pub use ;
pub use ;
pub use ;
pub use ;