# AGENTS.md - br-invoice-parser
> AI Agent Knowledge Base for the br-invoice-parser Rust library
## Project Overview
**br-invoice-parser** is a Rust library for parsing cloud provider invoices (PDF/XLSX) into structured data. It supports multiple vendors including AWS, Alibaba Cloud, UCloud, and their resellers (eCloudValley, Microfusion).
- **Package name:** `br-invoice-parser`
- **Library name:** `invoice_parser` (users import with `use invoice_parser::*`)
- **Edition:** Rust 2021
- **Total:** ~2,400 lines of Rust code
## Architecture
### Two-Layer Design
```
┌─────────────────────────────────────────────────────────┐
│ Public API (lib.rs) │
│
▼
┌─────────────────────────────────────────────────────────┐
│ Extractors Layer │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Parsers Layer │
│ format_detector.rs → routes to vendor-specific parser │
│ ┌─────────────┬─────────────┬─────────────────────────┐│
│ │ aws_direct │ ecloudvalley│ microfusion │ aliyun ││
│ │ │ (AWS reseller)│(Aliyun reseller)│ ││
│ └─────────────┴─────────────┴─────────────────────────┘│
│ common.rs (shared utilities) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Models (models.rs) │
```
### Module Structure
```
src/
├── lib.rs # Public API + re-exports + unit tests
├── error.rs # InvoiceParserError (thiserror)
├── models.rs # Data structures (598 lines)
├── extractors/
│ ├── mod.rs
│ ├── pdf.rs # PdfExtractor
│ └── xlsx.rs # XlsxExtractor
└── parsers/
├── mod.rs
├── invoice.rs # InvoiceParser orchestrator
├── format_detector.rs # Vendor detection
├── common.rs # Shared parsing utilities
├── aws_direct.rs # AWS Direct invoices
├── ecloudvalley.rs # eCloudValley (AWS reseller)
├── microfusion.rs # Microfusion (Aliyun reseller)
├── aliyun_direct.rs
└── ucloud.rs # UCloud XLSX
```
## Critical Warnings
### 1. Format Detection Order is CRITICAL
**Location:** `src/parsers/format_detector.rs`
```rust
// NEVER reorder these checks!
// Microfusion invoices may contain "Microsoft" in footer
// eCloudValley invoices contain "AWS" in content
1. Check Microfusion BEFORE Microsoft/Azure
2. Check eCloudValley BEFORE AWS
3. Check Aliyun BEFORE generic patterns
```
**Why:** Reseller invoices contain their upstream provider's name. Checking in wrong order will misidentify the format.
### 2. PDF Extraction Produces Malformed Data
**Location:** `src/parsers/ecloudvalley.rs`
PDF text extraction is unreliable. Numbers can be:
- **Concatenated:** `"$ 0.105650.0000000184"` (unit_price + usage merged)
- **Split across lines:** `"$\n\n0.00000024\n usage"`
- **Missing whitespace:** `"per GB$0.023"`
**Solution:** The parser uses 8+ regex patterns to handle variations. All patterns are necessary.
### 3. Deduplication is Required
Multiple regex patterns may match the same line item. Always use composite keys:
```rust
let key = format!("{}:{}:{}", description, amount_str, text_position);
```
Include text position to handle legitimate duplicates with same description/amount.
## Coding Conventions
### Pattern: lazy_static for Regex
```rust
lazy_static! {
static ref PATTERN_NAME: Regex = Regex::new(r"...").unwrap();
static ref PATTERNS: Vec<Regex> = vec![...];
}
```
All regex patterns are compiled once at startup.
### Pattern: Option<T> for Nullable Fields
```rust
pub struct Invoice {
pub invoice_number: Option<String>, // May not be found
pub total_amount: f64, // Always required
}
```
### Pattern: Extraction Funcs
```rust
pub fn extract_something(text: &str) -> Option<String> {
for pattern in PATTERNS.iter() {
if let Some(caps) = pattern.captures(text) {
if let Some(m) = caps.get(1) {
let value = m.as_str().trim();
if !value.is_empty() {
return Some(value.to_string());
}
}
}
}
None
}
```
### Pattern: Multi-Language Support
Support both English and Chinese colons:
```rust
Regex::new(r"(?i)account\s*no[::]\s*(\d+)") // [::] matches both
```
### Error Handling
```rust
#[derive(Error, Debug)]
pub enum InvoiceParserError {
#[error("Failed to read file: {0}")]
FileReadError(#[from] std::io::Error),
}
pub type Result<T> = std::result::Result<T, InvoiceParserError>;
```
## Testing
### Test Organization
- **Integration tests:** `tests/parser_tests.rs` (42 tests)
- **Unit tests:** Embedded in `src/lib.rs` (23 tests)
- **Format detection tests:** `src/parsers/format_detector.rs` (6 tests)
### Test Data
Real invoice files in `examples/pdf/`:
- 13 PDF files from various vendors
- 2 XLSX files (UCloud)
### Test Naming Convention
```
test_<vendor>_<feature>_<aspect>
Examples:
- test_aws_direct_invoice_parsing
- test_ecloudvalley_invoice_line_items
- test_microfusion_invoice_customer
```
### Running Tests
```bash
cargo test # All tests
cargo test ecloudvalley # Filter by name
cargo clippy --all-targets # Lint check
```
## Adding a New Vendor Parser
1. **Create parser file:** `src/parsers/newvendor.rs`
2. **Add to format_detector.rs** (ORDER MATTERS!):
```rust
if text_lower.contains("newvendor") {
return DocumentFormat::NewVendor;
}
```
3. **Add DocumentFormat variant:** `src/models.rs`
4. **Register in invoice.rs:**
```rust
DocumentFormat::NewVendor => newvendor::parse(text),
```
5. **Add test files:** `examples/pdf/`
6. **Write tests:** `tests/parser_tests.rs`
## Common Gotchas
### Date Extraction
Invoice date should come from **Billing Cycle end date**, not Payment Due Date:
```
Billing Cycle: December 1 - December 31, 2025 → invoice_date = 2025-12-31
```
### Line Item Filtering
Skip summary headers that look like line items:
```rust
}
```
### Customer Name Extraction
Avoid false positives:
```rust
if !name.to_lowercase().starts_with("billing")
&& !name.contains("invoice")
&& !name.contains(':')
{
// Valid customer name
}
```
### Zero Division Protection
```rust
if subtotal > 0.0 {
discount_rate = (discount.abs() / subtotal) * 100.0;
}
```
## Dependencies
| pdf-extract, lopdf | PDF text extraction |
| calamine | Excel/XLSX reading |
| regex, lazy_static | Pattern matching |
| chrono | Date/time handling |
| serde, serde_json | Serialization |
| thiserror | Error handling |
| rust_decimal | Financial precision |
## Quick Reference
### Parse an Invoice
```rust
use invoice_parser::{parse_file, parse_pdf, parse_text};
// Auto-detect format
let result = parse_file("invoice.pdf")?;
// Or specific format
let result = parse_pdf("invoice.pdf")?;
// Or from text
let invoice = parse_text("Invoice #123\nTotal: $100.00")?;
```
### Access Results
```rust
for invoice in result.invoices {
println!("Number: {:?}", invoice.invoice_number);
println!("Total: {}", invoice.total_amount);
println!("Currency: {:?}", invoice.currency);
for item in &invoice.line_items {
println!(" {} - ${}", item.description, item.amount);
}
}
```
### Validate Invoice
```rust
let validation = invoice.validate();
if !validation.is_valid {
for warning in &validation.warnings {
eprintln!("Warning: {}", warning);
}
}
```