ambers
Pure Rust SPSS .sav/.zsav reader — Arrow-native, zero C dependencies.
Features
- Read
.sav(bytecode) and.zsav(zlib) files - Arrow
RecordBatchoutput — zero-copy to Polars, DataFusion, DuckDB - Rich metadata: variable labels, value labels, missing values, MR sets, measure levels
- Lazy reader via
scan_sav()— returns Polars LazyFrame with projection and row limit pushdown - No PyArrow dependency — uses Arrow PyCapsule Interface for zero-copy transfer
- The fastest SPSS reader — up to 3x faster than polars_readstat, 10x faster than pyreadstat
- Python + Rust dual API from a single crate
Installation
Python:
Rust:
Quick Start
Python
# Eager read — data + metadata
, =
# Lazy read — returns Polars LazyFrame
, =
=
# Explore metadata
# Read metadata only (fast, skips data)
=
Rust
use ;
// Read data + metadata
let = read_sav?;
println!;
// Read metadata only
let meta = read_sav_metadata?;
println!;
Metadata API (Python)
| Method | Description |
|---|---|
meta.summary() |
Formatted overview: file info, type distribution, annotations |
meta.describe("Q1") |
Deep-dive into a single variable (or list of variables) |
meta.diff(other) |
Compare two metadata objects, returns MetaDiff |
meta.label("Q1") |
Variable label |
meta.value("Q1") |
Value labels dict |
meta.format("Q1") |
SPSS format string (e.g. "F8.2", "A50") |
meta.measure("Q1") |
Measurement level ("nominal", "ordinal", "scale") |
meta.schema |
Full metadata as a nested Python dict |
All variable-name methods raise KeyError for unknown variables.
Streaming Reader (Rust)
let mut scanner = scan_sav?;
scanner.select?;
scanner.limit;
while let Some = scanner.next_batch?
Performance
Eager Read
All results return a Polars DataFrame. Best of 3–5 runs (with warmup) on Windows 11, Python 3.13, 24-core machine.
| File | Size | Rows | Cols | ambers | polars_readstat | pyreadstat | vs prs | vs pyreadstat |
|---|---|---|---|---|---|---|---|---|
| test_1 (bytecode) | 0.2 MB | 1,500 | 75 | < 0.01s | < 0.01s | 0.011s | — | — |
| test_2 (bytecode) | 147 MB | 22,070 | 677 | 0.286s | 0.897s | 3.524s | 3.1x | 12x |
| test_3 (uncompressed) | 1.1 GB | 79,066 | 915 | 0.322s | 1.150s | 4.918s | 3.6x | 15x |
| test_4 (uncompressed) | 0.6 MB | 201 | 158 | 0.002s | 0.003s | 0.012s | 1.5x | 6x |
| test_5 (uncompressed) | 0.6 MB | 203 | 136 | 0.002s | 0.003s | 0.016s | 1.5x | 8x |
| test_6 (uncompressed) | 5.4 GB | 395,330 | 916 | 1.600s | 1.752s | 25.214s | 1.1x | 16x |
- Faster than polars_readstat on all tested files — 1.1–3.6x faster
- 6–16x faster than pyreadstat across all file sizes
- No PyArrow dependency — uses Arrow PyCapsule Interface for zero-copy transfer
Lazy Read with Pushdown
scan_sav() returns a Polars LazyFrame. Unlike eager reads, it only reads the data you ask for:
| File (size) | Full collect | Select 5 cols | Head 1000 rows | Select 5 + head 1000 |
|---|---|---|---|---|
| test_2 (147 MB, 22K × 677) | 0.903s | 0.363s (2.5x) | 0.181s (5.0x) | 0.157s (5.7x) |
| test_3 (1.1 GB, 79K × 915) | 0.700s | 0.554s (1.3x) | 0.020s (35x) | 0.012s (58x) |
| test_6 (5.4 GB, 395K × 916) | 3.062s | 2.343s (1.3x) | 0.022s (139x) | 0.013s (236x) |
On the 5.4 GB file, selecting 5 columns and 1000 rows completes in 13ms — 236x faster than reading the full dataset.