br-invoice-parser 0.1.9

# AGENTS.md - br-invoice-parser

**Generated:** 2026-02-04 | **Commit:** b1d6b84 | **Branch:** main

## Overview

Rust library for parsing cloud provider invoices (PDF/XLSX) into structured JSON. Supports AWS, Alibaba Cloud, UCloud, and resellers (eCloudValley, Microfusion).

- **Package:** `br-invoice-parser` → **Library:** `invoice_parser`
- **~2,600 lines** Rust, 18 source files

## Structure

```
br-invoice-parser/
├── src/
│   ├── lib.rs              # Public API + re-exports
│   ├── models.rs           # Invoice, LineItem, Party, Currency
│   ├── error.rs            # InvoiceParserError (thiserror)
│   ├── extractors/         # PDF/XLSX text extraction
│   └── parsers/            # Vendor-specific parsing (see AGENTS.md)
├── tests/parser_tests.rs   # Integration tests (43 tests, real PDF fixtures)
└── examples/pdf/           # Test fixtures (PDF/XLSX invoices)
```

## Where to Look

| Task | Location |
|------|----------|
| Add new vendor | `src/parsers/` + `format_detector.rs` |
| Fix line item parsing | `src/parsers/<vendor>.rs` regex patterns |
| Add new field | `src/models.rs` → update parsers |
| Debug format detection | `src/parsers/format_detector.rs` |
| Add test fixture | `examples/pdf/` + `tests/parser_tests.rs` |

## Critical Rules

### 1. Format Detection Order is CRITICAL

**File:** `src/parsers/format_detector.rs`

```
1. Microfusion BEFORE Microsoft/Azure
2. eCloudValley BEFORE AWS
3. Aliyun BEFORE generic
```

Reseller invoices contain upstream provider names. Wrong order = misidentification.

### 2. PDF Extraction Produces Malformed Data

Numbers can be concatenated (`"$ 0.105650.0000000184"`), split across lines, or missing whitespace. eCloudValley parser has 12+ regex patterns to handle this.

### 3. Field Mapping by Vendor

| Vendor | invoice_number | account_name | customer_id |
|--------|---------------|--------------|-------------|
| eCloudValley | None | Account alias | Account No |
| Microfusion | Invoice No | Customer ID + Name | Customer ID |
| AliyunDirect | Invoice No | Customer Name | Customer Name |
| UCloud | None | From filename | From filename |

### 4. UCloud Extracts customer_id from Filename

`parse_xlsx()` extracts `^(\d+)_` from filename → `customer_id` + `account_name`. Only works via `parse_file()`, not `parse_text()`.

## Conventions

### Regex: lazy_static

```rust
lazy_static! {
    static ref PATTERN: Regex = Regex::new(r"...").unwrap();
}
```

### Multi-language Support

```rust
Regex::new(r"account\s*no[：:]")  // [：:] matches Chinese/English colons
```

### Extraction Pattern

```rust
fn extract_X(text: &str) -> Option<String> {
    PATTERN.captures(text)
        .and_then(|caps| caps.get(1))
        .map(|m| m.as_str().trim().to_string())
}
```

## Anti-Patterns

- **NEVER** reorder format detection checks without updating tests
- **NEVER** assume PDF text is well-formed (always handle malformed data)
- **NEVER** skip deduplication for line items (use composite keys)
- **ALWAYS** use billing cycle end date as invoice_date (not payment due date)
- **ALWAYS** filter summary headers ("AWS Service Charge", "AWS Services Pricing")

## Commands

```bash
cargo test                           # All 72 tests (29 unit + 43 integration)
cargo test ecloudvalley              # Filter by name
cargo clippy --all-targets           # Lint (treat warnings as errors)
cargo run --example test_pdf -- "examples/pdf/file.pdf"
```

## Adding a New Vendor

1. Create `src/parsers/newvendor.rs`
2. Add detection in `format_detector.rs` (ORDER MATTERS)
3. Add `DocumentFormat::NewVendor` in `models.rs`
4. Register in `invoice.rs`: `DocumentFormat::NewVendor => newvendor::parse(text)`
5. Add test fixtures to `examples/pdf/`
6. Write tests in `tests/parser_tests.rs`

## Dependencies

| Crate | Purpose |
|-------|---------|
| pdf-extract, lopdf | PDF text extraction |
| calamine | Excel/XLSX reading |
| regex, lazy_static | Pattern matching |
| chrono | Date handling |
| serde, serde_json | Serialization |
| thiserror | Error types |