# AGENTS.md - br-invoice-parser
## Overview
Rust library for parsing cloud provider invoices (PDF/XLSX) into structured JSON.
- **Package:** `br-invoice-parser` → **Library crate:** `invoice_parser`
- **~3,800 lines** Rust, 22 source files, 101 tests (29 unit + 72 integration)
## Commands
```bash
cargo test # All 101 tests
cargo test ecloudvalley # Filter by name substring
cargo test test_hpk_azure_csp # Run single test
cargo clippy --all-targets --all-features -- -D warnings # Lint (strict)
cargo run --example test_pdf -- "examples/pdf/file.pdf" # Manual test
cargo publish # Publish to crates.io
```
## Structure
```
src/
├── lib.rs # Public API: parse_file(), parse_pdf(), parse_xlsx()
├── models.rs # Invoice, LineItem, Party, Currency, DocumentFormat
├── error.rs # InvoiceParserError (thiserror), Result<T> alias
├── extractors/
│ ├── pdf.rs # PDF → raw text (pdf-extract, lopdf)
│ └── xlsx.rs # XLSX → SheetData → TSV text (calamine)
└── parsers/
├── format_detector.rs # detect_format(text) → DocumentFormat (ORDER CRITICAL)
├── invoice.rs # InvoiceParser orchestrator, routes to vendor parsers
├── common.rs # Shared: fill_common_fields(), parse_amount(), parse_date_string()
├── ecloudvalley.rs # eCloudValley AWS reseller (most complex, 20+ regex)
├── aws_direct.rs # AWS Direct + Elite billing statements
├── microfusion.rs # Microfusion (宏庭) Aliyun/GCP reseller
├── microfusion_gcp_usage.rs # Microfusion GCP usage detail XLSX
├── mlytics_consolidated.rs # Mlytics consolidated invoices (PDF/XLSX)
├── azure_csp.rs # Azure CSP billing statements
├── aliyun_direct.rs # Alibaba Cloud Direct
├── aliyun_usage_detail.rs # Aliyun usage detail XLSX
├── ucloud.rs # UCloud XLSX
├── lokalise.rs # Lokalise SaaS
├── sentry.rs # Sentry SaaS
└── mux.rs # Mux SaaS
tests/parser_tests.rs # Integration tests (real PDF/XLSX fixtures)
examples/pdf/ # Test fixture files
```
## Format Detection Order (CRITICAL)
**File:** `src/parsers/format_detector.rs` — Order determines correctness.
```
1. Lokalise, Sentry, Mux (unique identifiers)
2. MlyticsConsolidated (contains "eCloudvalley", "AWS", "CSP-AZURE")
3. AliyunUsageDetail (TSV header match)
4. AzureCsp (contains "CSP-AZURE#")
5. MicrofusionGcpUsage (TSV header match)
6. Microfusion / 宏庭科技 (before "microsoft"/"azure")
7. eCloudValley / 伊雲谷 (before "aws"/"amazon")
8. AliyunDirect (before generic)
9. UCloud (before generic)
10. AwsDirect, GoogleCloud, Azure (generic fallbacks)
```
Reseller invoices contain upstream provider names. Wrong order = misidentification.
NEVER reorder without updating tests.
## Code Style
### Imports
```rust
use crate::models::{Currency, DocumentFormat, Invoice, InvoiceType, LineItem, Party};
use crate::parsers::common::parse_amount;
use chrono::NaiveDate; // external crates after crate imports
use lazy_static::lazy_static;
use regex::Regex;
```
### Regex: always lazy_static
```rust
lazy_static! {
// "$ 22,169.00 5% $ 21,061.00"
static ref PATTERN: Regex = Regex::new(r"...").unwrap();
}
```
Comments showing the exact input format a regex matches are required.
### Multi-language: match both Chinese and English punctuation
```rust
Regex::new(r"account\s*no[::]") // [::] matches fullwidth/ASCII colons
```
### Extraction pattern
```rust
fn extract_X(text: &str) -> Option<String> {
PATTERN.captures(text)
.and_then(|caps| caps.get(1))
.map(|m| m.as_str().trim().to_string())
}
```
### Error handling
- `thiserror` for `InvoiceParserError` variants
- `pub type Result<T> = std::result::Result<T, InvoiceParserError>;`
- Parsers return `Invoice` (not Result) — always produce best-effort result
- `Option<T>` for fields that may not be extractable
### Naming
- Parser files: `snake_case` vendor name (`ecloudvalley.rs`, `azure_csp.rs`)
- Each parser exposes `pub fn parse(text: &str) -> Invoice`
- DocumentFormat variants: `PascalCase` (`ECloudValleyAws`, `MlyticsConsolidated`)
- Regex statics: `SCREAMING_SNAKE_CASE` (`BILLING_CYCLE_PATTERN`)
### Line item deduplication
```rust
let key = format!("{}:{}:{}", description, amount_str, detail_pos);
if seen.contains(&key) { continue; }
seen.insert(key);
```
## Field Mapping by Vendor
| eCloudValley | None | Account alias | Account No | From billing cycle |
| Microfusion | Invoice No | Customer ID + Name | Customer ID | From text |
| AliyunDirect | Invoice No | Customer Name | Customer Name | From text |
| UCloud | None | From filename | From filename | From text |
| MlyticsConsolidated | MLT number | "Mlytics Limited" | Customer ID | From billing period |
| AzureCsp | None | From filename | CSP-AZURE# ID | From filename |
| AwsDirect/Elite | Account Number | None | Account Number | From billing statement |
| AliyunUsageDetail | None | None | User ID | From billing cycle |
| MicrofusionGcpUsage | None | Billing acct name | Billing acct ID | None |
## Adding a New Vendor
1. Create `src/parsers/newvendor.rs` with `pub fn parse(text: &str) -> Invoice`
2. Add `DocumentFormat::NewVendor` in `models.rs` (both enum + Display impl)
3. Add detection in `format_detector.rs` — **ORDER MATTERS**
4. Register in `mod.rs` and `invoice.rs` (import + match arm)
5. Add test fixtures to `examples/pdf/`
6. Write integration tests in `tests/parser_tests.rs`
7. Run `cargo test && cargo clippy --all-targets --all-features -- -D warnings`
## Anti-Patterns
- **NEVER** assume PDF text is well-formed (numbers concatenate, words split)
- **NEVER** skip line item deduplication
- **NEVER** suppress warnings without justification
- **ALWAYS** use billing cycle end date as `invoice_date` (not payment due date)
- **ALWAYS** filter summary headers from line items
- **ALWAYS** handle `$` before and after numbers (`$100` and `100$`)
- **ALWAYS** round monetary amounts to 2 decimal places
## Dependencies
| pdf-extract, lopdf | PDF text extraction |
| calamine | Excel/XLSX reading |
| regex, lazy_static | Pattern matching |
| chrono | Date handling |
| serde, serde_json | Serialization |
| thiserror | Error types |
| rust_decimal | Precise decimal arithmetic |