pub struct InvoiceExtractor { /* private fields */ }Expand description
Invoice data extractor with configurable pattern matching
This is the main entry point for invoice extraction. Use the builder pattern to configure language, confidence thresholds, and other options.
§Examples
use oxidize_pdf::text::invoice::InvoiceExtractor;
// Spanish invoices with high confidence threshold and kerning-aware spacing
let extractor = InvoiceExtractor::builder()
.with_language("es")
.confidence_threshold(0.85)
.use_kerning(true) // Enables font-aware spacing in text reconstruction
.build();§Thread Safety
InvoiceExtractor is immutable after construction and can be safely shared
across threads. Consider creating one extractor per language and reusing it.
Implementations§
Source§impl InvoiceExtractor
impl InvoiceExtractor
Sourcepub fn builder() -> InvoiceExtractorBuilder
pub fn builder() -> InvoiceExtractorBuilder
Create a new builder for configuring the extractor
This is the recommended way to create an InvoiceExtractor.
§Examples
use oxidize_pdf::text::invoice::InvoiceExtractor;
let extractor = InvoiceExtractor::builder()
.with_language("es")
.confidence_threshold(0.8)
.build();Sourcepub fn extract(&self, text_fragments: &[TextFragment]) -> Result<InvoiceData>
pub fn extract(&self, text_fragments: &[TextFragment]) -> Result<InvoiceData>
Extract structured invoice data from text fragments
This is the main extraction method. It processes text fragments from a PDF page and returns structured invoice data with confidence scores.
§Process
- Text fragments are reconstructed into full text
- Language-specific patterns are applied
- Matches are converted to typed fields
- Confidence scores are calculated
- Low-confidence fields are filtered out
§Arguments
text_fragments- Text fragments extracted from PDF page (fromTextExtractor)
§Returns
Returns Ok(InvoiceData) with extracted fields, or Err if:
- No text fragments provided
- PDF page is empty
- Text extraction failed
§Examples
use oxidize_pdf::text::extraction::{TextExtractor, ExtractionOptions};
use oxidize_pdf::text::invoice::InvoiceExtractor;
use oxidize_pdf::Document;
let doc = Document::open("invoice.pdf")?;
let page = doc.get_page(1)?;
// Extract text
let text_extractor = TextExtractor::new();
let extracted = text_extractor.extract_text(&doc, page, &ExtractionOptions::default())?;
// Extract invoice data
let extractor = InvoiceExtractor::builder()
.with_language("es")
.build();
let invoice = extractor.extract(&extracted.fragments)?;
// Access extracted fields
for field in &invoice.fields {
println!("{}: {:?} (confidence: {:.2})",
field.field_type.name(),
field.field_type,
field.confidence
);
}§Performance
Extraction is CPU-bound and typically completes in <100ms for standard invoices. The extractor can be safely reused across multiple pages and threads.
Sourcepub fn extract_from_text(&self, text: &str) -> Result<InvoiceData>
pub fn extract_from_text(&self, text: &str) -> Result<InvoiceData>
Extract invoice data from plain text (convenience method for testing)
This is a convenience wrapper around extract() that creates synthetic
TextFragment objects from plain text input. Primarily useful for testing
and simple scenarios where you don’t have actual PDF text fragments.
Note: This method creates fragments without position information, so proximity-based scoring may be less accurate than with real PDF fragments.
§Arguments
text- Plain text string to extract invoice data from
§Returns
Returns Ok(InvoiceData) with extracted fields, or Err if text is empty
§Examples
use oxidize_pdf::text::invoice::InvoiceExtractor;
let extractor = InvoiceExtractor::builder()
.with_language("en")
.confidence_threshold(0.7)
.build();
let invoice_text = "Invoice Number: INV-001\nTotal: £100.00";
let result = extractor.extract_from_text(invoice_text)?;
assert!(!result.fields.is_empty());Auto Trait Implementations§
impl Freeze for InvoiceExtractor
impl RefUnwindSafe for InvoiceExtractor
impl Send for InvoiceExtractor
impl Sync for InvoiceExtractor
impl Unpin for InvoiceExtractor
impl UnwindSafe for InvoiceExtractor
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more