# AGENTS.md - br-invoice-parser
## Overview
Rust library for parsing cloud provider invoices (PDF/XLSX) into structured JSON.
- **Package:** `br-invoice-parser` → **Library crate:** `invoice_parser`
- **~7,900 lines** Rust, 42 source files, 115 tests (29 unit + 86 integration)
## Commands
```bash
cargo test # All 115 tests
cargo test ecloudvalley # Filter by name substring
cargo test test_hpk_azure_csp # Run single test
cargo clippy --all-targets --all-features -- -D warnings # Lint (strict)
cargo run --example test_pdf -- "examples/pdf/file.pdf" # Manual test
cargo publish # Publish to crates.io
```
## Structure
```
src/
├── lib.rs # Public API: parse_file(), parse_pdf(), parse_xlsx()
├── models.rs # Invoice, LineItem, Party, Currency, DocumentFormat
├── error.rs # InvoiceParserError (thiserror), Result<T> alias
├── extractors/
│ ├── pdf.rs # PDF → raw text (pdf-extract, lopdf)
│ └── xlsx.rs # XLSX → SheetData → TSV text (calamine)
└── parsers/
├── format_detector.rs # detect_format(text) → DocumentFormat (ORDER CRITICAL)
├── invoice.rs # InvoiceParser orchestrator, routes to vendor parsers
├── common.rs # Shared: fill_common_fields(), parse_amount(), extract_reseller_line_items()
├── ecloudvalley.rs # eCloudValley AWS reseller (most complex, 20+ regex)
├── aws_direct.rs # AWS Direct + Elite + reseller billing statements
├── microfusion.rs # Microfusion (宏庭) Aliyun/GCP reseller
├── microfusion_gcp_usage.rs # Microfusion GCP usage detail XLSX
├── mlytics_consolidated.rs # Mlytics consolidated invoices (PDF/XLSX)
├── azure_csp.rs # Azure CSP billing statements
├── aliyun_direct.rs # Alibaba Cloud Direct
├── aliyun_usage_detail.rs # Aliyun usage detail XLSX
├── ucloud.rs # UCloud XLSX
├── lokalise.rs # Lokalise SaaS
├── sentry.rs # Sentry SaaS
├── mux.rs # Mux SaaS
├── chargebee.rs # Chargebee SaaS (multiple qty/price formats)
├── edgenext.rs # Edgenext/mCloud CDN (DD/MM/YYYY dates)
├── hubspot.rs # HubSpot receipts
├── slack.rs # Slack billing statements
├── atlassian.rs # Atlassian (spaced-out PDF text)
├── metaage_akamai.rs # MetaAge Akamai (spaced digits)
├── vnetwork.rs # VNetwork tax invoices
├── tencent_edgeone.rs # Tencent EdgeOne XLSX (6-14 column formats)
├── datastar.rs # Data Star Limited
├── digicentre_hk.rs # Digicentre HK
├── cloudmile.rs # CloudMile
├── contentsquare.rs # Contentsquare
├── reachtop.rs # Reachtop KSHK
├── generic_consultant.rs # ZQ Technology
├── vnis_invoice.rs # VNIS/MCDN invoices
├── vnis_summary.rs # VNIS summary XLSX
├── azure_plan_daily.rs # Azure Plan daily billing XLSX
├── google_workspace_billing.rs # Google Workspace billing XLSX
├── cdn_overage.rs # CDN overage detail
└── cdn_traffic.rs # CDN traffic data (non-invoice)
tests/parser_tests.rs # Integration tests (real PDF/XLSX fixtures)
examples/pdf/ # Test fixture files
```
## Format Detection Order (CRITICAL)
**File:** `src/parsers/format_detector.rs` — Order determines correctness.
```
1. Atlassian (strip-spaces detection: "a tla ssia n")
2. Contentsquare (may contain \0 chars)
3. Slack ("slack technologies")
4. Lokalise, Sentry, Mux (unique identifiers)
5. Chargebee ("powered by chargebee")
6. HubSpot ("hubspot")
7. Reachtop ("tcsp licence")
8. CdnTraffic ("edge bytes sum")
9. VnisSummary ("mlytics share" + "amount to pay mlytics")
10. VNetwork ("vnetwork" + "tax invoice")
11. VnisInvoice ("mcdn service fee" strip-spaces)
12. MlyticsConsolidated (invoice + billed to + billing context)
13. CdnOverageDetail ("超量明细")
14. TencentEdgeOne ("边缘安全加速平台" / "edgeone套餐")
15. GoogleWorkspaceBilling (TSV header match)
16. AzurePlanDaily ("訂單單號")
17. AliyunUsageDetail (TSV header match)
18. AzureCsp ("csp-azure#")
19. MicrofusionGcpUsage (TSV header match)
20. MetaageAkamai ("邁達特" / "akamai 繳費通知單")
21. GenericConsultant ("zq technology")
22. Microfusion / 宏庭科技 (before "microsoft"/"azure")
23. eCloudValley / 伊雲谷 (before "aws"/"amazon")
24. Edgenext ("edgenext")
25. DataStar ("data star" / "tencent eo")
26. DigicentreHk ("digicentre")
27. CloudMile ("cloudmile")
28. AliyunDirect (before generic)
29. UCloud (before generic)
30. AwsDirect, GoogleCloud, Azure (generic fallbacks)
```
Reseller invoices contain upstream provider names. Wrong order = misidentification.
NEVER reorder without updating tests.
## Code Style
### Imports
```rust
use crate::models::{Currency, DocumentFormat, Invoice, InvoiceType, LineItem, Party};
use crate::parsers::common::parse_amount;
use chrono::NaiveDate; // external crates after crate imports
use lazy_static::lazy_static;
use regex::Regex;
```
### Regex: always lazy_static
```rust
lazy_static! {
// "$ 22,169.00 5% $ 21,061.00"
static ref PATTERN: Regex = Regex::new(r"...").unwrap();
}
```
Comments showing the exact input format a regex matches are required.
### Multi-language: match both Chinese and English punctuation
```rust
Regex::new(r"account\s*no[::]") // [::] matches fullwidth/ASCII colons
```
### Extraction pattern
```rust
fn extract_X(text: &str) -> Option<String> {
PATTERN.captures(text)
.and_then(|caps| caps.get(1))
.map(|m| m.as_str().trim().to_string())
}
```
### Error handling
- `thiserror` for `InvoiceParserError` variants
- `pub type Result<T> = std::result::Result<T, InvoiceParserError>;`
- Parsers return `Invoice` (not Result) — always produce best-effort result
- `Option<T>` for fields that may not be extractable
### Naming
- Parser files: `snake_case` vendor name (`ecloudvalley.rs`, `azure_csp.rs`)
- Each parser exposes `pub fn parse(text: &str) -> Invoice`
- DocumentFormat variants: `PascalCase` (`ECloudValleyAws`, `MlyticsConsolidated`)
- Regex statics: `SCREAMING_SNAKE_CASE` (`BILLING_CYCLE_PATTERN`)
### Line item deduplication
```rust
let key = format!("{}:{}:{}", description, amount_str, detail_pos);
if seen.contains(&key) { continue; }
seen.insert(key);
```
## Field Mapping by Vendor
| eCloudValley | None | Account alias | Account No | From billing cycle |
| Microfusion | Invoice No | Customer ID + Name | Customer ID | From text |
| AliyunDirect | Invoice No | Customer Name | Customer Name | From text |
| UCloud | None | From filename | From filename | From text |
| MlyticsConsolidated | MLT number | "Mlytics Limited" | Customer ID | From billing period |
| AzureCsp | None | From filename | CSP-AZURE# ID | From filename |
| AwsDirect/Elite | Account Number | None | Account Number | From billing statement |
| AliyunUsageDetail | None | None | User ID | From billing cycle |
| MicrofusionGcpUsage | None | Billing acct name | Billing acct ID | None |
| Chargebee | Invoice No | None | None | From text |
| Edgenext | Invoice No | Bill To name | None | From line item dates |
| HubSpot | Receipt # | None | Hub ID | From payment date |
| Slack | Invoice # | None | None | From invoice date |
| Atlassian | IN-xxx-xxx-xxx | None | None | From billing period text |
| VNetwork | Invoice No | None | None | From invoice date |
| TencentEdgeOne | None | None | UIN | None |
| MetaageAkamai | None | None | None | From text |
## Adding a New Vendor
1. Create `src/parsers/newvendor.rs` with `pub fn parse(text: &str) -> Invoice`
2. Add `DocumentFormat::NewVendor` in `models.rs` (both enum + Display impl)
3. Add detection in `format_detector.rs` — **ORDER MATTERS**
4. Register in `mod.rs` and `invoice.rs` (import + match arm)
5. Add test fixtures to `examples/pdf/`
6. Write integration tests in `tests/parser_tests.rs`
7. Run `cargo test && cargo clippy --all-targets --all-features -- -D warnings`
## Anti-Patterns
- **NEVER** assume PDF text is well-formed (numbers concatenate, words split, chars spaced out)
- **NEVER** skip line item deduplication
- **NEVER** suppress warnings without justification
- **ALWAYS** use billing cycle end date as `invoice_date` (not payment due date)
- **ALWAYS** filter summary headers from line items
- **ALWAYS** handle `$` before and after numbers (`$100` and `100$`)
- **ALWAYS** round monetary amounts to 2 decimal places (except sub-cent amounts like 0.0031)
- **ALWAYS** handle parenthesized amounts as credits/payments, not line items
## Dependencies
| pdf-extract, lopdf | PDF text extraction |
| calamine | Excel/XLSX reading |
| regex, lazy_static | Pattern matching |
| chrono | Date handling |
| serde, serde_json | Serialization |
| thiserror | Error types |
| rust_decimal | Precise decimal arithmetic |