Struct InvoiceExtractor

Source

pub struct InvoiceExtractor { /* private fields */ }

Expand description

Invoice data extractor with configurable pattern matching

This is the main entry point for invoice extraction. Use the builder pattern to configure language, confidence thresholds, and other options.

§Examples

use oxidize_pdf::text::invoice::InvoiceExtractor;

// Spanish invoices with high confidence threshold and kerning-aware spacing
let extractor = InvoiceExtractor::builder()
    .with_language("es")
    .confidence_threshold(0.85)
    .use_kerning(true)  // Enables font-aware spacing in text reconstruction
    .build();

§Thread Safety

InvoiceExtractor is immutable after construction and can be safely shared across threads. Consider creating one extractor per language and reusing it.

Implementations§

Source §

impl InvoiceExtractor

Source

pub fn builder() -> InvoiceExtractorBuilder

Create a new builder for configuring the extractor

This is the recommended way to create an InvoiceExtractor.

§Examples

use oxidize_pdf::text::invoice::InvoiceExtractor;

let extractor = InvoiceExtractor::builder()
    .with_language("es")
    .confidence_threshold(0.8)
    .build();

Source

pub fn extract(&self, text_fragments: &[TextFragment]) -> Result<InvoiceData>

Extract structured invoice data from text fragments

This is the main extraction method. It processes text fragments from a PDF page and returns structured invoice data with confidence scores.

§Process

Text fragments are reconstructed into full text
Language-specific patterns are applied
Matches are converted to typed fields
Confidence scores are calculated
Low-confidence fields are filtered out

§Arguments

text_fragments - Text fragments extracted from PDF page (from TextExtractor)

§Returns

Returns Ok(InvoiceData) with extracted fields, or Err if:

No text fragments provided
PDF page is empty
Text extraction failed

§Examples

use oxidize_pdf::text::extraction::{TextExtractor, ExtractionOptions};
use oxidize_pdf::text::invoice::InvoiceExtractor;
use oxidize_pdf::Document;

let doc = Document::open("invoice.pdf")?;
let page = doc.get_page(1)?;

// Extract text
let text_extractor = TextExtractor::new();
let extracted = text_extractor.extract_text(&doc, page, &ExtractionOptions::default())?;

// Extract invoice data
let extractor = InvoiceExtractor::builder()
    .with_language("es")
    .build();

let invoice = extractor.extract(&extracted.fragments)?;

// Access extracted fields
for field in &invoice.fields {
    println!("{}: {:?} (confidence: {:.2})",
        field.field_type.name(),
        field.field_type,
        field.confidence
    );
}

§Performance

Extraction is CPU-bound and typically completes in <100ms for standard invoices. The extractor can be safely reused across multiple pages and threads.

Source

pub fn extract_from_text(&self, text: &str) -> Result<InvoiceData>

Extract invoice data from plain text (convenience method for testing)

This is a convenience wrapper around extract() that creates synthetic TextFragment objects from plain text input. Primarily useful for testing and simple scenarios where you don’t have actual PDF text fragments.

Note: This method creates fragments without position information, so proximity-based scoring may be less accurate than with real PDF fragments.

§Arguments

text - Plain text string to extract invoice data from

§Returns

Returns Ok(InvoiceData) with extracted fields, or Err if text is empty

§Examples

use oxidize_pdf::text::invoice::InvoiceExtractor;

let extractor = InvoiceExtractor::builder()
    .with_language("en")
    .confidence_threshold(0.7)
    .build();

let invoice_text = "Invoice Number: INV-001\nTotal: £100.00";
let result = extractor.extract_from_text(invoice_text)?;

assert!(!result.fields.is_empty());

Auto Trait Implementations§

§

impl UnwindSafe for InvoiceExtractor

Blanket Implementations§

Source §

impl<T> Any for T
where T: 'static + ?Sized,

Source §

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

Source §

impl<T> Borrow<T> for T
where T: ?Sized,

Source §

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

Source §

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source §

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

Source §

impl<T> From<T> for T

Source §

fn from(t: T) -> T

Returns the argument unchanged.

Source §

impl<T> Instrument for T

Source §

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more

Source §

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more

Source §

impl<T, U> Into for T
where U: From<T>,

Source §

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source §

impl<T> IntoEither for T

Source §

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more

Source §