ambers 0.1.7

Pure Rust reader for SPSS .sav and .zsav files
Documentation
# Git Commit Rules

- **No Co-Authored-By lines.** All commits are authored solely by the repository owner. Never add `Co-Authored-By` trailers.

---

# ambers — Pure Rust Statistical File Reader

Like amber preserving ancient life, ambers safely captures and preserves data from statistical file formats. Native Rust, no C FFI bindings. Framework-agnostic Arrow output.

**Package name:** `ambers` (crates.io + PyPI). Python: `import ambers` (or `import ambers as am`).

## Milestone 1: SPSS Reader (COMPLETE)

### Rust API (`crates/ambers/src/lib.rs`)

```rust
// Eager read
read_sav(path) -> Result<(RecordBatch, SpssMetadata)>
read_sav_from_reader(reader) -> Result<(RecordBatch, SpssMetadata)>
read_sav_metadata(path) -> Result<SpssMetadata>

// Streaming scanner with column projection + row limits
scan_sav(path) -> Result<SavScanner<BufReader<File>>>
scan_sav_from_reader(reader, batch_size) -> Result<SavScanner<R>>
```

- **Data output:** Arrow `RecordBatch` — zero-copy to Polars, DataFusion, DuckDB
- **Metadata output:** `SpssMetadata` — IndexMap-based (insertion-ordered), all fields use `variable_` prefix
- **Compression:** Uncompressed, bytecode (.sav), zlib (.zsav)
- **Tests:** 37 unit tests + 2 doc-tests, verified against 7 real-world files (3–22,070 rows, 75–677 columns)

### SpssMetadata fields

| Field | Type | Description |
|-------|------|-------------|
| `file_label` | `String` | File label set in SPSS |
| `file_encoding` | `String` | Character encoding (e.g. "UTF-8") |
| `compression` | `Compression` | None / Bytecode / Zlib |
| `creation_time` | `String` | Date created |
| `modification_time` | `String` | Time created |
| `notes` | `Vec<String>` | Document records |
| `number_rows` | `Option<i64>` | Row count from header |
| `number_columns` | `usize` | Visible variable count |
| `file_format` | `String` | "sav" or "zsav" |
| `variable_names` | `Vec<String>` | Ordered column names (defines Arrow schema order) |
| `variable_labels` | `IndexMap<String, String>` | Variable name -> label |
| `spss_variable_types` | `IndexMap<String, String>` | SPSS format strings: "F8.2", "A50" |
| `rust_variable_types` | `IndexMap<String, String>` | Rust types: "f64", "String" |
| `variable_value_labels` | `IndexMap<String, IndexMap<Value, String>>` | Per-variable value->label maps |
| `variable_measure` | `IndexMap<String, Measure>` | Nominal / Ordinal / Scale (from file) |
| `variable_alignment` | `IndexMap<String, Alignment>` | Left / Right / Center |
| `variable_display_width` | `IndexMap<String, u32>` | Display width |
| `variable_storage_width` | `IndexMap<String, usize>` | Storage width in bytes |
| `variable_missing` | `IndexMap<String, Vec<MissingSpec>>` | Missing value specs |
| `mr_sets` | `IndexMap<String, MrSet>` | Multiple response sets |
| `weight_variable` | `Option<String>` | Weight variable name |

Convenience methods: `label()`, `value_labels()`, `format()`, `measure()`

---

## Milestone 2: PyO3 Python Bindings (COMPLETE)

### Python API

```python
import ambers

# Eager read
df, meta = ambers.read_sav("file.sav")            # Polars DataFrame + SpssMetadata
meta = ambers.read_sav_metadata("file.sav")        # metadata only (fast)

# Lazy read (returns Polars LazyFrame)
lf, meta = ambers.scan_sav("file.sav")
df = lf.select(["Q1", "Q2"]).head(1000).collect()  # projection + row limit pushdown

# SpssMetadata properties (same fields as Rust)
meta.variable_names       # list[str]
meta.variable_labels      # dict[str, str]
meta.variable_value_labels # dict[str, dict[float|str, str]]
meta.compression          # str: "none", "bytecode", "zlib"
meta.variable_measure     # dict[str, str]: "nominal", "ordinal", "scale"

# Convenience methods
meta.label("age")         # str | None
meta.format("age")        # str | None (e.g. "F8.2")
meta.measure("age")       # str | None
```

### Architecture

Single crate with optional `python` feature for PyO3 bindings:

```
ambers/
├── src/                    Pure Rust library (crates.io)
│   ├── python/mod.rs       PyO3 bindings (feature-gated)
├── python/ambers/          Python package (PyPI: "ambers")
│   ├── __init__.py         Wraps native module, PyCapsule → Polars
│   └── __init__.pyi        Type stubs
└── pyproject.toml          maturin build config
```

**Data flow (PyCapsule — no PyArrow dependency):**
```
Rust RecordBatch → PyArrowData.__arrow_c_stream__() → PyCapsule(FFI_ArrowArrayStream) → pl.from_arrow() → Polars DataFrame
```

Uses the Arrow PyCapsule Interface (see references below) for zero-copy transfer.
`pyarrow` is NOT a runtime dependency. Polars >= 1.3 consumes PyCapsules natively.

**Lazy data flow (scan_sav):**
```
Python scan_sav() → register_io_source(schema, _source) → LazyFrame
  └─ _source() calls _SavBatchReader.next_batch() → PyArrowData (PyCapsule) → pl.from_arrow() → yield DataFrame
```

Supports column projection pushdown, row limit pushdown, and per-batch predicate filtering.

### Build & Install

```bash
# Development build
uv venv --python 3.13 .venv
source .venv/Scripts/activate  # or .venv/bin/activate on Linux/Mac
uv pip install maturin polars
maturin develop --release

# Verify
python -c "import ambers; df, meta = ambers.read_sav('file.sav'); print(df.shape)"
```

---

## Project Structure

```
ambers/                             Repo root (single crate)
  Cargo.toml                        Crate config (python feature = optional PyO3)
  pyproject.toml                    maturin build config
  CLAUDE.md                         This file
  python/
    ambers/
      __init__.py                   Python API wrapper (read_sav, scan_sav, read_sav_metadata)
      __init__.pyi                  Type stubs
  src/
    lib.rs                          Public API + re-exports
    main.rs                         CLI binary for testing
    scanner.rs                      SavScanner: streaming batch reader
    columnar.rs                     ColumnarBatchBuilder: direct-to-Arrow column construction
    arrow_convert.rs                Arrow Schema builder
    error.rs                        SpssError enum (thiserror)
    constants.rs                    SYSMIS, enums (Compression, Measure, Alignment, VarType)
    io_utils.rs                     SavReader<R> with endian-aware reads
    header.rs                       176-byte file header parsing
    encoding.rs                     Code page -> encoding_rs mapping
    variable.rs                     Type 2 variable records, MissingValues enum
    value_labels.rs                 Type 3+4 value label records
    document.rs                     Type 6 document records
    metadata.rs                     SpssMetadata struct, Value enum, MissingSpec, MrSet
    dictionary.rs                   Record dispatch + post-dictionary resolution
    info_records/                   Subtype dispatch (3,4,11,13,14,20,21,22)
    compression/
      bytecode.rs                   Stateful bytecode decompressor
      zlib.rs                       ZSAV zheader/ztrailer + flate2
    python/
      mod.rs                        PyO3 bindings: PyArrowData (PyCapsule), PySavBatchReader, PySpssMetadata
  test_data/                        Real-world .sav files for benchmarking (gitignored)
```

---

## SAV Binary Format Quick Reference

### Record types (dictionary section)
- **Type 2** — Variable record (short name, type, width, label, missing values, format)
- **Type 3+4** — Value labels (type 3 = label pairs, type 4 = variable indices, always paired)
- **Type 6** — Document record (80-byte lines of notes)
- **Type 7** — Info/extension records (dispatched by subtype)
- **Type 999** — Dictionary termination, data follows

### Info record subtypes
- **3** — Integer info: endianness, character code page
- **4** — Float info: SYSMIS, highest, lowest bit patterns
- **11** — Variable display: measure, display_width, alignment per variable
- **13** — Long variable names: `SHORT=LongName` tab-separated pairs
- **14** — Very long strings: `VARNAME=WIDTH` for strings > 255 bytes
- **20** — Encoding name (overrides subtype 3 code page)
- **21** — Long string value labels (pascal-string format)
- **22** — Long string missing values

### Data section
After type 999, data is stored as rows of 8-byte slots:
- **Uncompressed:** Raw 8-byte values (f64 for numeric, padded bytes for string)
- **Bytecode compressed:** 8-byte control blocks where each byte is an opcode, followed by raw data for opcode 253
- **Zlib compressed (.zsav):** Block-based zlib, decompressed blocks feed into bytecode decompressor

---

## Key Technical Decisions

| Decision | Detail |
|----------|--------|
| **SYSMIS** | `-DBL_MAX` (bits `0xFFEFFFFFFFFFFFFF`), NOT NaN. Maps to Arrow null. |
| **Encoding priority** | Subtype 20 name > subtype 3 code page > default windows-1252 |
| **Bytecode decompressor** | Stateful across rows — control blocks do NOT align to row boundaries |
| **Very long strings** | Subtype 14 declares true width. `n_segments = ceil(width/252)`. Subsequent named segment variables marked as ghosts. |
| **Arrow types** | Numeric -> Float64 (nullable), String -> Utf8. Date/time stay as Float64 with format in metadata. |
| **Missing values** | SYSMIS -> Arrow null. User-defined missing ranges in metadata only (not nullified), matching pyreadstat behavior. |
| **Endianness** | Detected from header `layout_code`. `SavReader` handles byte-swapping transparently. |
| **No packed structs** | Fields read individually via `io_utils` helpers — safe, handles endian swapping. |
| **Naming** | All metadata fields use `variable_` prefix. IndexMap for O(1) lookup with insertion-order preservation. |
| **Value sorting** | `Value` implements `Ord` — numeric values sort by actual number, not string representation. |
| **Columnar builders** | `ColumnarBatchBuilder` pushes decoded values directly into Arrow `Float64Builder`/`StringBuilder`, skipping the old `Vec<Vec<CellValue>>` intermediate. Reuses a `string_buf: Vec<u8>` across rows. |
| **PyCapsule (no PyArrow)** | `PyArrowData.__arrow_c_stream__()` exports `FFI_ArrowArrayStream` via `PyCapsule`. Polars `pl.from_arrow()` consumes it natively. `pyarrow` is not a dependency. |
| **Lazy IO** | `scan_sav()` uses Polars `register_io_source` with `_SavBatchReader` (wraps `SavScanner`). Supports column projection, row limit, and predicate pushdown. |

---

## Bugs Found & Lessons Learned

1. **Column count wrong (847 vs 677):** Very long string segment variables (e.g. SURVE0, SURVE1) weren't marked as ghosts. Fix: walk forward from each VLS variable and mark the next `n_segments - 1` non-ghost named records as ghosts.

2. **Row count wrong (87 vs 1500):** Bytecode decompressor was stateless — control block state lost between row calls. Fix: made `BytecodeDecompressor` stateful, preserving `pos`, `control_bytes`, `control_idx`, `eof` across `decompress_row()` calls.

3. **SYSMIS is not NaN:** SYSMIS is `-DBL_MAX` (most negative finite double), not a NaN. Test assertions needed to check `is_finite()` and `== -f64::MAX`.

---

## Testing

```bash
cargo test                # 37 unit tests + 2 doc-tests
cargo run -- file.sav     # CLI test with any .sav file
```

### Verified test files
| File | Columns | Rows | Compression |
|------|---------|------|-------------|
| n=3.sav | 393 | 3 | Bytecode |
| n=10.sav | 393 | 10 | Bytecode |
| 251001.sav | 677 | 22,070 | Bytecode |
| rpm_2025_data_final_raw_id.sav | 186 | 1,000 | Bytecode |
| rcg_2025_Bahamas_Freeport_clean.sav | 124 | 211 | Uncompressed |
| DR_Landscape SPSS (AUG 30).sav | 75 | 1,500 | Bytecode |
| 231104.sav | 216 | 2,146 | Bytecode |

---

## Dependencies

### Core Rust library (`crates/ambers`)

| Crate | Version | Purpose |
|-------|---------|---------|
| `arrow` | 57 | Arrow RecordBatch output |
| `flate2` | 1 | Zlib decompression for .zsav |
| `encoding_rs` | 0.8 | Character encoding conversion |
| `thiserror` | 2 | Error type derivation |
| `rayon` | 1 | Parallel row/column processing |
| `indexmap` | 2 | Insertion-ordered maps for metadata |

### Python bindings (feature-gated in same crate)

| Crate | Version | Purpose |
|-------|---------|---------|
| `pyo3` | 0.26 | Python extension module |
| `arrow` | 57 (ffi feature) | Arrow PyCapsule / FFI_ArrowArrayStream export |

### Python dependencies

| Package | Version | Purpose |
|---------|---------|---------|
| `polars` | >=1.3 | DataFrame output (consumes PyCapsules natively) |

---

## Next Steps

### Milestone 3: SAV/ZSAV Writer
- Write `RecordBatch` + `SpssMetadata` back to `.sav`/`.zsav` files
- Reverse the data flow: Arrow -> rows -> bytecode/zlib compression -> binary
- Update Python bindings with `write_sav()` function

### Future: Additional Format Support
- ambers is designed to grow beyond SPSS — the name and architecture support adding readers/writers for other statistical formats (Stata .dta, SAS .sas7bdat, etc.)

### Future Improvements
- Arrow temporal types for DATE, TIME, DATETIME formats (currently Float64)
- `columns` / `row_limit` params on Python `read_sav()`
- Real `.zsav` file testing

---

## SPSS Ecosystem References

When working on SPSS-related Rust or Python code (ambers, ultrasav, rimpy, or any .sav file tooling), reference these upstream repos as needed:

### Core Libraries

- **readstat** (C library) — https://github.com/WizardMac/ReadStat
  The foundational C library for reading/writing SPSS, Stata, and SAS files. Defines the canonical SPSS binary format parsing. pyreadstat wraps this via Cython.
- **pyreadstat** (Python/Cython) — https://github.com/Roche/pyreadstat
  Python wrapper around readstat. Our primary comparison target for metadata correctness. When debugging metadata discrepancies, check pyreadstat's Cython layer and readstat's C source.
- **polars_readstat** (Python API) — https://github.com/jrothbaum/polars_readstat
  Python API for reading SPSS/Stata/SAS files into Polars DataFrames. Built on the Rust core below.
- **polars_readstat_rs** (Rust core) — https://github.com/jrothbaum/polars_readstat_rs
  Rust-based SPSS/Stata/SAS reader. Useful reference for Rust SPSS parsing patterns and Polars integration.

### Excel Ecosystem (Rust/Python pattern reference)

- **fastexcel** (Python API) — https://github.com/ToucanToco/fastexcel
  Fast Excel file reader for Python, built on calamine. Good reference for the Rust-core-with-Python-wrapper architecture pattern (similar to what ambers does for SPSS).
- **calamine** (Rust core) — https://github.com/tafia/calamine
  Pure Rust library for reading Excel files (xlsx, xls, ods). The engine behind fastexcel. Reference for idiomatic Rust binary file I/O patterns.

### Polars

- **polars** (Rust/Python) — https://github.com/pola-rs/polars
  The target DataFrame library. Goal is to support both lazy and eager reads into Polars DataFrames, and ideally have ambers' Rust core adopted by Polars for their df.read_sav function (similar to how Polars uses calamine for Excel via fastexcel).
- **register_io_source**https://docs.pola.rs/api/python/stable/reference/io_plugins/index.html
  Polars plugin API for creating LazyFrames from custom IO sources. Used by `scan_sav()` to expose the Rust batch reader as a lazy source with projection/limit/predicate pushdown. Preferred over `AnonymousScan` for Python-wrapped Rust libraries.

### Arrow PyCapsule Interface

- **Arrow PyCapsule Interface spec**https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html
  The specification ambers uses to transfer Arrow data from Rust to Python without PyArrow. Key concepts:
  - Producers export data by implementing `__arrow_c_stream__` (for record batches) or `__arrow_c_array__` (for single arrays) dunder methods
  - These methods return a `PyCapsule` wrapping an `FFI_ArrowArrayStream` or `FFI_ArrowArray` C struct
  - Consumers (like Polars `pl.from_arrow()`) call the dunder method and consume the capsule
  - The capsule name MUST be `"arrow_array_stream"` (for streams) or `"arrow_array"` (for arrays)
  - The `__arrow_c_stream__` method signature is `(self, requested_schema=None) -> PyCapsule` — the `requested_schema` parameter must have a default of `None`
  - In Rust: `arrow::ffi_stream::FFI_ArrowArrayStream::new(reader)` creates the FFI struct from a `RecordBatchReader`, then `PyCapsule::new(py, ffi_stream, Some(CString::new("arrow_array_stream")))` wraps it
  - This replaces the old `arrow/pyarrow` feature + `to_pyarrow(py)` path which required PyArrow as a runtime dependency
- **Arrow C Data Interface**https://arrow.apache.org/docs/format/CDataInterface.html
  The underlying C ABI that PyCapsule wraps. Defines `ArrowSchema`, `ArrowArray`, and `ArrowArrayStream` structs for zero-copy cross-language data exchange.
- **Arrow C Stream Interface**https://arrow.apache.org/docs/format/CStreamInterface.html
  Extension of the C Data Interface for streaming record batches. `FFI_ArrowArrayStream` implements this, providing `get_schema()`, `get_next()`, and `release()` callbacks.