br-invoice-parser 0.1.13

A Rust library for parsing invoices and bills from PDF and XLSX files
Documentation
# AGENTS.md - br-invoice-parser

**Version:** 0.1.11 | **Branch:** main

## Overview

Rust library for parsing cloud provider invoices (PDF/XLSX) into structured JSON.
- **Package:** `br-invoice-parser`**Library crate:** `invoice_parser`
- **~3,800 lines** Rust, 22 source files, 101 tests (29 unit + 72 integration)

## Commands

```bash
cargo test                                    # All 101 tests
cargo test ecloudvalley                       # Filter by name substring
cargo test test_hpk_azure_csp                 # Run single test
cargo clippy --all-targets --all-features -- -D warnings  # Lint (strict)
cargo run --example test_pdf -- "examples/pdf/file.pdf"   # Manual test
cargo publish                                 # Publish to crates.io
```

## Structure

```
src/
├── lib.rs                    # Public API: parse_file(), parse_pdf(), parse_xlsx()
├── models.rs                 # Invoice, LineItem, Party, Currency, DocumentFormat
├── error.rs                  # InvoiceParserError (thiserror), Result<T> alias
├── extractors/
│   ├── pdf.rs                # PDF → raw text (pdf-extract, lopdf)
│   └── xlsx.rs               # XLSX → SheetData → TSV text (calamine)
└── parsers/
    ├── format_detector.rs    # detect_format(text) → DocumentFormat (ORDER CRITICAL)
    ├── invoice.rs            # InvoiceParser orchestrator, routes to vendor parsers
    ├── common.rs             # Shared: fill_common_fields(), parse_amount(), parse_date_string()
    ├── ecloudvalley.rs       # eCloudValley AWS reseller (most complex, 20+ regex)
    ├── aws_direct.rs         # AWS Direct + Elite billing statements
    ├── microfusion.rs        # Microfusion (宏庭) Aliyun/GCP reseller
    ├── microfusion_gcp_usage.rs  # Microfusion GCP usage detail XLSX
    ├── mlytics_consolidated.rs   # Mlytics consolidated invoices (PDF/XLSX)
    ├── azure_csp.rs          # Azure CSP billing statements
    ├── aliyun_direct.rs      # Alibaba Cloud Direct
    ├── aliyun_usage_detail.rs    # Aliyun usage detail XLSX
    ├── ucloud.rs             # UCloud XLSX
    ├── lokalise.rs           # Lokalise SaaS
    ├── sentry.rs             # Sentry SaaS
    └── mux.rs                # Mux SaaS
tests/parser_tests.rs         # Integration tests (real PDF/XLSX fixtures)
examples/pdf/                  # Test fixture files
```

## Format Detection Order (CRITICAL)

**File:** `src/parsers/format_detector.rs` — Order determines correctness.

```
1. Lokalise, Sentry, Mux           (unique identifiers)
2. MlyticsConsolidated              (contains "eCloudvalley", "AWS", "CSP-AZURE")
3. AliyunUsageDetail                (TSV header match)
4. AzureCsp                         (contains "CSP-AZURE#")
5. MicrofusionGcpUsage              (TSV header match)
6. Microfusion / 宏庭科技            (before "microsoft"/"azure")
7. eCloudValley / 伊雲谷            (before "aws"/"amazon")
8. AliyunDirect                     (before generic)
9. UCloud                           (before generic)
10. AwsDirect, GoogleCloud, Azure   (generic fallbacks)
```

Reseller invoices contain upstream provider names. Wrong order = misidentification.
NEVER reorder without updating tests.

## Code Style

### Imports
```rust
use crate::models::{Currency, DocumentFormat, Invoice, InvoiceType, LineItem, Party};
use crate::parsers::common::parse_amount;
use chrono::NaiveDate;          // external crates after crate imports
use lazy_static::lazy_static;
use regex::Regex;
```

### Regex: always lazy_static
```rust
lazy_static! {
    // "$ 22,169.00 5% $ 21,061.00"
    static ref PATTERN: Regex = Regex::new(r"...").unwrap();
}
```
Comments showing the exact input format a regex matches are required.

### Multi-language: match both Chinese and English punctuation
```rust
Regex::new(r"account\s*no[::]")  // [::] matches fullwidth/ASCII colons
```

### Extraction pattern
```rust
fn extract_X(text: &str) -> Option<String> {
    PATTERN.captures(text)
        .and_then(|caps| caps.get(1))
        .map(|m| m.as_str().trim().to_string())
}
```

### Error handling
- `thiserror` for `InvoiceParserError` variants
- `pub type Result<T> = std::result::Result<T, InvoiceParserError>;`
- Parsers return `Invoice` (not Result) — always produce best-effort result
- `Option<T>` for fields that may not be extractable

### Naming
- Parser files: `snake_case` vendor name (`ecloudvalley.rs`, `azure_csp.rs`)
- Each parser exposes `pub fn parse(text: &str) -> Invoice`
- DocumentFormat variants: `PascalCase` (`ECloudValleyAws`, `MlyticsConsolidated`)
- Regex statics: `SCREAMING_SNAKE_CASE` (`BILLING_CYCLE_PATTERN`)

### Line item deduplication
```rust
let key = format!("{}:{}:{}", description, amount_str, detail_pos);
if seen.contains(&key) { continue; }
seen.insert(key);
```

## Field Mapping by Vendor

| Vendor | invoice_number | account_name | customer_id | billing_period |
|--------|---------------|--------------|-------------|----------------|
| eCloudValley | None | Account alias | Account No | From billing cycle |
| Microfusion | Invoice No | Customer ID + Name | Customer ID | From text |
| AliyunDirect | Invoice No | Customer Name | Customer Name | From text |
| UCloud | None | From filename | From filename | From text |
| MlyticsConsolidated | MLT number | "Mlytics Limited" | Customer ID | From billing period |
| AzureCsp | None | From filename | CSP-AZURE# ID | From filename |
| AwsDirect/Elite | Account Number | None | Account Number | From billing statement |
| AliyunUsageDetail | None | None | User ID | From billing cycle |
| MicrofusionGcpUsage | None | Billing acct name | Billing acct ID | None |

## Adding a New Vendor

1. Create `src/parsers/newvendor.rs` with `pub fn parse(text: &str) -> Invoice`
2. Add `DocumentFormat::NewVendor` in `models.rs` (both enum + Display impl)
3. Add detection in `format_detector.rs`**ORDER MATTERS**
4. Register in `mod.rs` and `invoice.rs` (import + match arm)
5. Add test fixtures to `examples/pdf/`
6. Write integration tests in `tests/parser_tests.rs`
7. Run `cargo test && cargo clippy --all-targets --all-features -- -D warnings`

## Anti-Patterns

- **NEVER** assume PDF text is well-formed (numbers concatenate, words split)
- **NEVER** skip line item deduplication
- **NEVER** suppress warnings without justification
- **ALWAYS** use billing cycle end date as `invoice_date` (not payment due date)
- **ALWAYS** filter summary headers from line items
- **ALWAYS** handle `$` before and after numbers (`$100` and `100$`)
- **ALWAYS** round monetary amounts to 2 decimal places

## Dependencies

| Crate | Purpose |
|-------|---------|
| pdf-extract, lopdf | PDF text extraction |
| calamine | Excel/XLSX reading |
| regex, lazy_static | Pattern matching |
| chrono | Date handling |
| serde, serde_json | Serialization |
| thiserror | Error types |
| rust_decimal | Precise decimal arithmetic |