br-invoice-parser 0.1.8

A Rust library for parsing invoices and bills from PDF and XLSX files
Documentation
# AGENTS.md - br-invoice-parser

> AI Agent Knowledge Base for the br-invoice-parser Rust library

## Project Overview

**br-invoice-parser** is a Rust library for parsing cloud provider invoices (PDF/XLSX) into structured data. It supports multiple vendors including AWS, Alibaba Cloud, UCloud, and their resellers (eCloudValley, Microfusion).

- **Package name:** `br-invoice-parser`
- **Library name:** `invoice_parser` (users import with `use invoice_parser::*`)
- **Edition:** Rust 2021
- **Total:** ~2,400 lines of Rust code

## Architecture

### Two-Layer Design

```
┌─────────────────────────────────────────────────────────┐
│                    Public API (lib.rs)                  │
│  parse_file() | parse_pdf() | parse_xlsx() | parse_text()│
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│                  Extractors Layer                        │
│         PDF → raw text    |    XLSX → cell data         │
│         (pdf.rs)          |    (xlsx.rs)                │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│                   Parsers Layer                          │
│  format_detector.rs → routes to vendor-specific parser  │
│  ┌─────────────┬─────────────┬─────────────────────────┐│
│  │ aws_direct  │ ecloudvalley│ microfusion │ aliyun    ││
│  │             │ (AWS reseller)│(Aliyun reseller)│     ││
│  └─────────────┴─────────────┴─────────────────────────┘│
│                    common.rs (shared utilities)          │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│                   Models (models.rs)                     │
│  Invoice | LineItem | Party | Currency | ParseResult    │
└─────────────────────────────────────────────────────────┘
```

### Module Structure

```
src/
├── lib.rs              # Public API + re-exports + unit tests
├── error.rs            # InvoiceParserError (thiserror)
├── models.rs           # Data structures (598 lines)
├── extractors/
│   ├── mod.rs
│   ├── pdf.rs          # PdfExtractor
│   └── xlsx.rs         # XlsxExtractor
└── parsers/
    ├── mod.rs
    ├── invoice.rs      # InvoiceParser orchestrator
    ├── format_detector.rs  # Vendor detection
    ├── common.rs       # Shared parsing utilities
    ├── aws_direct.rs   # AWS Direct invoices
    ├── ecloudvalley.rs # eCloudValley (AWS reseller)
    ├── microfusion.rs  # Microfusion (Aliyun reseller)
    ├── aliyun_direct.rs
    └── ucloud.rs       # UCloud XLSX
```

## Critical Warnings

### 1. Format Detection Order is CRITICAL

**Location:** `src/parsers/format_detector.rs`

```rust
// NEVER reorder these checks!
// Microfusion invoices may contain "Microsoft" in footer
// eCloudValley invoices contain "AWS" in content

1. Check Microfusion BEFORE Microsoft/Azure
2. Check eCloudValley BEFORE AWS
3. Check Aliyun BEFORE generic patterns
```

**Why:** Reseller invoices contain their upstream provider's name. Checking in wrong order will misidentify the format.

### 2. PDF Extraction Produces Malformed Data

**Location:** `src/parsers/ecloudvalley.rs`

PDF text extraction is unreliable. Numbers can be:
- **Concatenated:** `"$ 0.105650.0000000184"` (unit_price + usage merged)
- **Split across lines:** `"$\n\n0.00000024\n usage"`
- **Missing whitespace:** `"per GB$0.023"`

**Solution:** The parser uses 8+ regex patterns to handle variations. All patterns are necessary.

### 3. Deduplication is Required

Multiple regex patterns may match the same line item. Always use composite keys:

```rust
let key = format!("{}:{}:{}", description, amount_str, text_position);
```

Include text position to handle legitimate duplicates with same description/amount.

## Coding Conventions

### Pattern: lazy_static for Regex

```rust
lazy_static! {
    static ref PATTERN_NAME: Regex = Regex::new(r"...").unwrap();
    static ref PATTERNS: Vec<Regex> = vec![...];
}
```

All regex patterns are compiled once at startup.

### Pattern: Option<T> for Nullable Fields

```rust
pub struct Invoice {
    pub invoice_number: Option<String>,  // May not be found
    pub total_amount: f64,               // Always required
}
```

### Pattern: Extraction Funcs

```rust
pub fn extract_something(text: &str) -> Option<String> {
    for pattern in PATTERNS.iter() {
        if let Some(caps) = pattern.captures(text) {
            if let Some(m) = caps.get(1) {
                let value = m.as_str().trim();
                if !value.is_empty() {
                    return Some(value.to_string());
                }
            }
        }
    }
    None
}
```

### Pattern: Multi-Language Support

Support both English and Chinese colons:
```rust
Regex::new(r"(?i)account\s*no[::]\s*(\d+)")  // [::] matches both
```

### Error Handling

```rust
#[derive(Error, Debug)]
pub enum InvoiceParserError {
    #[error("Failed to read file: {0}")]
    FileReadError(#[from] std::io::Error),
}

pub type Result<T> = std::result::Result<T, InvoiceParserError>;
```

## Testing

### Test Organization

- **Integration tests:** `tests/parser_tests.rs` (42 tests)
- **Unit tests:** Embedded in `src/lib.rs` (23 tests)
- **Format detection tests:** `src/parsers/format_detector.rs` (6 tests)

### Test Data

Real invoice files in `examples/pdf/`:
- 13 PDF files from various vendors
- 2 XLSX files (UCloud)

### Test Naming Convention

```
test_<vendor>_<feature>_<aspect>

Examples:
- test_aws_direct_invoice_parsing
- test_ecloudvalley_invoice_line_items
- test_microfusion_invoice_customer
```

### Running Tests

```bash
cargo test                    # All tests
cargo test ecloudvalley       # Filter by name
cargo clippy --all-targets    # Lint check
```

## Adding a New Vendor Parser

1. **Create parser file:** `src/parsers/newvendor.rs`

2. **Add to format_detector.rs** (ORDER MATTERS!):
   ```rust
   // Add BEFORE any vendor whose name might appear in this invoice
   if text_lower.contains("newvendor") {
       return DocumentFormat::NewVendor;
   }
   ```

3. **Add DocumentFormat variant:** `src/models.rs`

4. **Register in invoice.rs:**
   ```rust
   DocumentFormat::NewVendor => newvendor::parse(text),
   ```

5. **Add test files:** `examples/pdf/`

6. **Write tests:** `tests/parser_tests.rs`

## Common Gotchas

### Date Extraction

Invoice date should come from **Billing Cycle end date**, not Payment Due Date:
```
Billing Cycle: December 1 - December 31, 2025  →  invoice_date = 2025-12-31
```

### Line Item Filtering

Skip summary headers that look like line items:
```rust
if description == "AWS Service Charge" || description == "AWS Services Pricing" {
    continue;
}
```

### Customer Name Extraction

Avoid false positives:
```rust
if !name.to_lowercase().starts_with("billing")
    && !name.contains("invoice")
    && !name.contains(':')
{
    // Valid customer name
}
```

### Zero Division Protection

```rust
if subtotal > 0.0 {
    discount_rate = (discount.abs() / subtotal) * 100.0;
}
```

## Dependencies

| Crate | Purpose |
|-------|---------|
| pdf-extract, lopdf | PDF text extraction |
| calamine | Excel/XLSX reading |
| regex, lazy_static | Pattern matching |
| chrono | Date/time handling |
| serde, serde_json | Serialization |
| thiserror | Error handling |
| rust_decimal | Financial precision |

## Quick Reference

### Parse an Invoice

```rust
use invoice_parser::{parse_file, parse_pdf, parse_text};

// Auto-detect format
let result = parse_file("invoice.pdf")?;

// Or specific format
let result = parse_pdf("invoice.pdf")?;

// Or from text
let invoice = parse_text("Invoice #123\nTotal: $100.00")?;
```

### Access Results

```rust
for invoice in result.invoices {
    println!("Number: {:?}", invoice.invoice_number);
    println!("Total: {}", invoice.total_amount);
    println!("Currency: {:?}", invoice.currency);
    
    for item in &invoice.line_items {
        println!("  {} - ${}", item.description, item.amount);
    }
}
```

### Validate Invoice

```rust
let validation = invoice.validate();
if !validation.is_valid {
    for warning in &validation.warnings {
        eprintln!("Warning: {}", warning);
    }
}
```