# AGENTS.md - br-invoice-parser
## Overview
Rust library for parsing cloud provider invoices (PDF/XLSX) into structured JSON. Supports AWS, Alibaba Cloud, UCloud, and resellers (eCloudValley, Microfusion).
- **Package:** `br-invoice-parser` → **Library:** `invoice_parser`
- **~2,600 lines** Rust, 18 source files
## Structure
```
br-invoice-parser/
├── src/
│ ├── lib.rs # Public API + re-exports
│ ├── models.rs # Invoice, LineItem, Party, Currency
│ ├── error.rs # InvoiceParserError (thiserror)
│ ├── extractors/ # PDF/XLSX text extraction
│ └── parsers/ # Vendor-specific parsing (see AGENTS.md)
├── tests/parser_tests.rs # Integration tests (43 tests, real PDF fixtures)
└── examples/pdf/ # Test fixtures (PDF/XLSX invoices)
```
## Where to Look
| Add new vendor | `src/parsers/` + `format_detector.rs` |
| Fix line item parsing | `src/parsers/<vendor>.rs` regex patterns |
| Add new field | `src/models.rs` → update parsers |
| Debug format detection | `src/parsers/format_detector.rs` |
| Add test fixture | `examples/pdf/` + `tests/parser_tests.rs` |
## Critical Rules
### 1. Format Detection Order is CRITICAL
**File:** `src/parsers/format_detector.rs`
```
1. Microfusion BEFORE Microsoft/Azure
2. eCloudValley BEFORE AWS
3. Aliyun BEFORE generic
```
Reseller invoices contain upstream provider names. Wrong order = misidentification.
### 2. PDF Extraction Produces Malformed Data
Numbers can be concatenated (`"$ 0.105650.0000000184"`), split across lines, or missing whitespace. eCloudValley parser has 12+ regex patterns to handle this.
### 3. Field Mapping by Vendor
| Vendor | invoice_number | account_name | customer_id |
|--------|---------------|--------------|-------------|
| eCloudValley | None | Account alias | Account No |
| Microfusion | Invoice No | Customer ID + Name | Customer ID |
| AliyunDirect | Invoice No | Customer Name | Customer Name |
| UCloud | None | From filename | From filename |
### 4. UCloud Extracts customer_id from Filename
`parse_xlsx()` extracts `^(\d+)_` from filename → `customer_id` + `account_name`. Only works via `parse_file()`, not `parse_text()`.
## Conventions
### Regex: lazy_static
```rust
lazy_static! {
static ref PATTERN: Regex = Regex::new(r"...").unwrap();
}
```
### Multi-language Support
```rust
Regex::new(r"account\s*no[::]") // [::] matches Chinese/English colons
```
### Extraction Pattern
```rust
fn extract_X(text: &str) -> Option<String> {
PATTERN.captures(text)
.and_then(|caps| caps.get(1))
.map(|m| m.as_str().trim().to_string())
}
```
## Anti-Patterns
- **NEVER** reorder format detection checks without updating tests
- **NEVER** assume PDF text is well-formed (always handle malformed data)
- **NEVER** skip deduplication for line items (use composite keys)
- **ALWAYS** use billing cycle end date as invoice_date (not payment due date)
- **ALWAYS** filter summary headers ("AWS Service Charge", "AWS Services Pricing")
## Commands
```bash
cargo test # All 72 tests (29 unit + 43 integration)
cargo test ecloudvalley # Filter by name
cargo clippy --all-targets # Lint (treat warnings as errors)
cargo run --example test_pdf -- "examples/pdf/file.pdf"
```
## Adding a New Vendor
1. Create `src/parsers/newvendor.rs`
2. Add detection in `format_detector.rs` (ORDER MATTERS)
3. Add `DocumentFormat::NewVendor` in `models.rs`
4. Register in `invoice.rs`: `DocumentFormat::NewVendor => newvendor::parse(text)`
5. Add test fixtures to `examples/pdf/`
6. Write tests in `tests/parser_tests.rs`
## Dependencies
| pdf-extract, lopdf | PDF text extraction |
| calamine | Excel/XLSX reading |
| regex, lazy_static | Pattern matching |
| chrono | Date handling |
| serde, serde_json | Serialization |
| thiserror | Error types |