oxidize-pdf 2.5.2

# "Contents must be a stream" Error - Root Cause Analysis

**Date**: 2025-10-22
**Status**: ✅ Root Cause Identified

## Summary

The error "Contents must be a stream or array of streams" occurs because the Contents field in both failing PDFs resolves to **Null** instead of a Stream or Array object.

## Affected PDFs

1. **BayWa**: `Factura1-2025-345 FREE MOUNTAIN SYSTEMS, S.L..pdf`
   - Contents field: `Reference(7, 0)` → resolves to `Null`

2. **Greencoat**: `Others - 251007 - Bann Road Solar Project Limited - 12307.pdf`
   - Contents field: `Reference(5, 0)` → resolves to `Null`

## Technical Details

### Current Code Behavior

Located in `oxidize-pdf-core/src/parser/document.rs:920-949`:

```rust
pub fn get_page_content_streams(&self, page: &ParsedPage) -> ParseResult<Vec<Vec<u8>>> {
    let mut streams = Vec::new();
    let options = self.options();

    if let Some(contents) = page.dict.get("Contents") {
        let resolved_contents = self.resolve(contents)?;

        match &resolved_contents {
            PdfObject::Stream(stream) => { ... }      // ✓ Handles Stream
            PdfObject::Array(array) => { ... }        // ✓ Handles Array
            _ => {                                    // ✗ Falls through for Null
                return Err(ParseError::SyntaxError {
                    position: 0,
                    message: "Contents must be a stream or array of streams".to_string(),
                })
            }
        }
    }

    Ok(streams)
}
```

### Why Contents Resolves to Null

The object being referenced (object 7 in BayWa, object 5 in Greencoat) is either:
1. **Missing** from the PDF file (corrupted xref)
2. **Explicitly null** in the xref table
3. **Malformed** object that parser treats as null

This is a **real error** in the PDF structure, but the error message is misleading.

## Impact

- **Severity**: Medium - PDFs are malformed but could be handled gracefully
- **User Experience**: Poor - error message doesn't indicate the true problem
- **Workaround**: None - parser fails immediately

## Proposed Fix

### Option A: Better Error Message (Conservative)
```rust
PdfObject::Null => {
    return Err(ParseError::SyntaxError {
        position: 0,
        message: format!("Contents field references object {} {} R which is null or missing", obj_id, gen_id),
    })
}
```

**Pros**: Clear error, user knows PDF is broken
**Cons**: Still fails, doesn't process other pages

### Option B: Graceful Degradation (Robust)
```rust
PdfObject::Null => {
    eprintln!("Warning: Contents field is null (page may be blank)");
    // Return empty streams - treat as blank page
}
```

**Pros**: Parser continues, other pages can be processed
**Cons**: Silent failure may hide real issues

### Option C: Both (Recommended) ✅
```rust
PdfObject::Null => {
    // Log warning but continue
    eprintln!("Warning: Contents field references null object (page may be blank)");
    return Ok(Vec::new()); // Empty page
}
_ => {
    return Err(ParseError::SyntaxError {
        position: 0,
        message: format!("Contents must be a stream or array, found: {:?}", resolved_contents),
    })
}
```

**Pros**:
- Graceful for null (common in corrupted PDFs)
- Clear error for unexpected types
- Allows multi-page PDF processing to continue

**Cons**: None

## Implementation Plan

1. Modify `get_page_content_streams()` in `document.rs:920-949`
2. Add explicit `PdfObject::Null` case before the `_` wildcard
3. Return `Ok(Vec::new())` for null (empty page)
4. Improve error message for other unexpected types
5. Add test case for PDFs with null contents

## Verification

After fix, both PDFs should:
- ✅ Parse successfully (no error)
- ✅ Return empty content streams for page 0
- ✅ Allow table detection to run (will find no tables, which is correct)

## Related Issues

- REPS PDF has different error: "Invalid xref table" (separate fix needed)
- ENcome PDF unexpectedly passes (may have been fixed already)