Skip to main content

InvoiceExtractor

Struct InvoiceExtractor 

Source
pub struct InvoiceExtractor { /* private fields */ }
Expand description

Invoice data extractor with configurable pattern matching

This is the main entry point for invoice extraction. Use the builder pattern to configure language, confidence thresholds, and other options.

§Examples

use oxidize_pdf::text::invoice::InvoiceExtractor;

// Spanish invoices with high confidence threshold and kerning-aware spacing
let extractor = InvoiceExtractor::builder()
    .with_language("es")
    .confidence_threshold(0.85)
    .use_kerning(true)  // Enables font-aware spacing in text reconstruction
    .build();

§Thread Safety

InvoiceExtractor is immutable after construction and can be safely shared across threads. Consider creating one extractor per language and reusing it.

Implementations§

Source§

impl InvoiceExtractor

Source

pub fn builder() -> InvoiceExtractorBuilder

Create a new builder for configuring the extractor

This is the recommended way to create an InvoiceExtractor.

§Examples
use oxidize_pdf::text::invoice::InvoiceExtractor;

let extractor = InvoiceExtractor::builder()
    .with_language("es")
    .confidence_threshold(0.8)
    .build();
Source

pub fn extract(&self, text_fragments: &[TextFragment]) -> Result<InvoiceData>

Extract structured invoice data from text fragments

This is the main extraction method. It processes text fragments from a PDF page and returns structured invoice data with confidence scores.

§Process
  1. Text fragments are reconstructed into full text
  2. Language-specific patterns are applied
  3. Matches are converted to typed fields
  4. Confidence scores are calculated
  5. Low-confidence fields are filtered out
§Arguments
  • text_fragments - Text fragments extracted from PDF page (from TextExtractor)
§Returns

Returns Ok(InvoiceData) with extracted fields, or Err if:

  • No text fragments provided
  • PDF page is empty
  • Text extraction failed
§Examples
use oxidize_pdf::text::extraction::{TextExtractor, ExtractionOptions};
use oxidize_pdf::text::invoice::InvoiceExtractor;
use oxidize_pdf::Document;

let doc = Document::open("invoice.pdf")?;
let page = doc.get_page(1)?;

// Extract text
let text_extractor = TextExtractor::new();
let extracted = text_extractor.extract_text(&doc, page, &ExtractionOptions::default())?;

// Extract invoice data
let extractor = InvoiceExtractor::builder()
    .with_language("es")
    .build();

let invoice = extractor.extract(&extracted.fragments)?;

// Access extracted fields
for field in &invoice.fields {
    println!("{}: {:?} (confidence: {:.2})",
        field.field_type.name(),
        field.field_type,
        field.confidence
    );
}
§Performance

Extraction is CPU-bound and typically completes in <100ms for standard invoices. The extractor can be safely reused across multiple pages and threads.

Source

pub fn extract_from_text(&self, text: &str) -> Result<InvoiceData>

Extract invoice data from plain text (convenience method for testing)

This is a convenience wrapper around extract() that creates synthetic TextFragment objects from plain text input. Primarily useful for testing and simple scenarios where you don’t have actual PDF text fragments.

Note: This method creates fragments without position information, so proximity-based scoring may be less accurate than with real PDF fragments.

§Arguments
  • text - Plain text string to extract invoice data from
§Returns

Returns Ok(InvoiceData) with extracted fields, or Err if text is empty

§Examples
use oxidize_pdf::text::invoice::InvoiceExtractor;

let extractor = InvoiceExtractor::builder()
    .with_language("en")
    .confidence_threshold(0.7)
    .build();

let invoice_text = "Invoice Number: INV-001\nTotal: £100.00";
let result = extractor.extract_from_text(invoice_text)?;

assert!(!result.fields.is_empty());

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more