# SpssMetadata Reference
Complete reference for the `SpssMetadata` class — ambers' metadata API for reading, constructing, and writing SPSS .sav/.zsav files.
---
## Quick Start
```python
import ambers as am
# Read metadata from a file
df, meta = am.read_sav("survey.sav")
meta.summary() # formatted overview
meta.describe("Q1") # single variable deep-dive
# Construct metadata from scratch
meta = am.SpssMetadata(
file_label="Customer Survey 2026",
variable_labels={"Q1": "Satisfaction", "Q2": "Loyalty"},
variable_value_labels={"Q1": {1: "Low", 5: "High"}},
variable_measures={"Q1": "ordinal", "Q2": "nominal"},
)
am.write_sav(df, "output.sav", meta=meta)
# Override existing metadata (immutable — returns new instance)
meta2 = meta.update(file_label="Updated Survey", variable_labels={"Q3": "NPS"})
# Chainable single-field setters
meta3 = (am.SpssMetadata()
.with_variable_labels({"Q1": "Satisfaction"})
.with_variable_measures({"Q1": "ordinal"})
.with_file_label("My Survey")
)
```
---
## Core Design Principles
ambers separates **data** (Polars DataFrame) from **metadata** (`am.SpssMetadata`) as two independent objects — mirroring how SPSS itself keeps data and metadata as distinct layers within a `.sav` file.
### 1. Explicit Control
- No hidden magic or automatic transfers between data and metadata
- You know exactly what metadata each variable has
- Clear, predictable behavior — what you set is what gets written
### 2. Clean Separation
- **Data** lives in a Polars DataFrame — filter, join, rename, and transform using Polars' full API
- **Metadata** lives in `SpssMetadata` — labels, formats, missing values, and other SPSS properties
- Neither needs to know about changes to the other
### 3. Flexibility
- Read an SPSS file and get both data + metadata, or metadata only
- Build metadata from scratch for data from any source (CSV, Excel, databases)
- Mix and match — apply metadata from one file to data from another
- Write with no metadata at all — sensible defaults are inferred from the DataFrame
### 4. Easy to Reason About
When working with ambers, the mental model is simple:
- **Transforming data?** Use Polars (`df.filter()`, `df.rename()`, `df.select()`)
- **Updating labels, formats, or SPSS properties?** Use `SpssMetadata` (`.update()`, `.with_*()`)
- **Reading or writing files?** Use `read_sav()` / `write_sav()`
- **Need both together?** Pass `meta=` to `write_sav()` — ambers matches metadata to DataFrame columns by name
---
## Construction
### `SpssMetadata(**kwargs)`
Creates a new metadata object. All parameters are optional keyword arguments.
```python
meta = am.SpssMetadata() # empty (defaults filled at write time)
meta = am.SpssMetadata(file_label="My Survey") # with specific fields
```
### `meta.update(**kwargs) -> SpssMetadata`
Returns a **new** `SpssMetadata` with the given fields merged/replaced. The original is unchanged (immutable).
### Update Logic
Updates work as an **overlay** — your changes are applied on top of existing metadata without destroying anything you didn't explicitly touch.
**Dict fields** (variable_labels, variable_measures, variable_formats, etc.) are keyed by **variable name** (the column name in your DataFrame). They merge at the variable-name level:
- **Existing key** — value is overwritten with your new value
- **New key** — added to the dict
- **Unlisted keys** — preserved as-is from the original
- **`{key: None}`** — removes that key
```python
# Original metadata (from read_sav or constructor)
meta = am.SpssMetadata(
variable_labels={"Q1": "Satisfaction", "Q2": "Loyalty", "Q3": "NPS"},
)
# Update: overwrite Q1, add Q4, leave Q2 and Q3 untouched
meta2 = meta.update(variable_labels={"Q1": "Overall Satisfaction", "Q4": "Age"})
# Result: {"Q1": "Overall Satisfaction", "Q2": "Loyalty", "Q3": "NPS", "Q4": "Age"}
# Remove Q3's label
meta3 = meta.update(variable_labels={"Q3": None})
# Result: {"Q1": "Satisfaction", "Q2": "Loyalty"}
```
**Scalar fields** (`file_label`, `weight_variable`) and **`notes`** are replaced entirely:
```python
meta2 = meta.update(file_label="Updated Survey") # old file_label is gone
meta3 = meta.update(notes=["Wave 2 data"]) # old notes list is replaced
```
### Column Renames
Metadata is keyed by variable name. If you rename a column in your DataFrame, metadata does **not** automatically carry over — there is no tracking or mapping between old and new names.
```python
df, meta = am.read_sav("survey.sav")
# Rename Q1 → satisfaction in the DataFrame
df = df.rename({"Q1": "satisfaction"})
# Metadata still has "Q1" — it does NOT follow the rename
meta.label("Q1") # "Overall Satisfaction" (still there)
meta.label("satisfaction") # KeyError — no metadata for this name
# If you wish the new satisfaction column in your output spss data contains proper variable_label, you must explicitly provide metadata for the new name
meta = meta.update(variable_labels={"satisfaction": "Overall Satisfaction"})
```
At write time, metadata for columns not present in the DataFrame is silently ignored — so the leftover `"Q1"` metadata won't cause errors, it just won't be written.
### `meta.with_*() -> SpssMetadata`
Chainable single-field convenience methods — syntax sugar for `update()`. Each `with_*()` call is equivalent to calling `update()` with the corresponding keyword argument. They exist for IDE discoverability and fluent chaining.
| `file_label=` | `with_file_label(str)` | `str` |
| `notes=` | `with_notes(str \| list[str])` | `str \| list[str]` |
| `weight_variable=` | `with_weight_variable(str \| None)` | `str \| None` |
| `variable_labels=` | `with_variable_labels(dict)` | `dict[str, str \| None]` |
| `variable_value_labels=` | `with_variable_value_labels(dict)` | `dict[str, dict \| None]` |
| `variable_formats=` | `with_variable_formats(dict)` | `dict[str, str \| None]` |
| `variable_measures=` | `with_variable_measures(dict)` | `dict[str, str \| None]` |
| `variable_display_widths=` | `with_variable_display_widths(dict)` | `dict[str, int \| None]` |
| `variable_alignments=` | `with_variable_alignments(dict)` | `dict[str, str \| None]` |
| `variable_missing_values=` | `with_variable_missing_values(dict)` | `dict[str, dict \| None]` |
| `variable_roles=` | `with_variable_roles(dict)` | `dict[str, str \| None]` |
| `variable_attributes=` | `with_variable_attributes(dict)` | `dict[str, dict \| None]` |
| `mr_sets=` | `with_mr_sets(dict)` | `dict[str, dict \| None]` |
---
## Field Reference
### Readable and Writable Fields
These fields can be set via constructor, `update()`, or `with_*()` methods, and read as properties.
| `file_label` | `str` | `"Customer Survey 2026"` | Any string (truncated to 64 bytes in SAV) |
| `notes` | `list[str]` or `str` | `["Wave 1 data"]` | Any string(s) |
| `weight_variable` | `str \| None` | `"weight"` | Must match a DataFrame column at write time |
| `variable_labels` | `dict[str, str]` | `{"Q1": "Satisfaction"}` | Any string values |
| `variable_value_labels` | `dict[str, dict]` | `{"Q1": {1: "Low", 5: "High"}}` | Keys: int/float/str; Values: str |
| `variable_formats` | `dict[str, str]` | `{"Q1": "F8.2", "name": "A50"}` | SPSS format strings |
| `variable_measures` | `dict[str, str]` | `{"Q1": "ordinal"}` | `"nominal"` / `"ordinal"` / `"scale"` / `"unknown"` |
| `variable_alignments` | `dict[str, str]` | `{"Q1": "right"}` | `"left"` / `"right"` / `"center"` |
| `variable_display_widths` | `dict[str, int]` | `{"Q1": 8}` | Positive integer |
| `variable_roles` | `dict[str, str]` | `{"Q1": "input"}` | `"input"` / `"target"` / `"both"` / `"none"` / `"partition"` / `"split"` |
| `variable_attributes` | `dict[str, dict]` | `{"Q1": {"Source": ["Survey"]}}` | `dict[str, list[str]]` per variable |
| `variable_missing_values` | `dict[str, dict]` | See [Missing Values](#missing-values) | Dict structure validated |
| `mr_sets` | `dict[str, dict]` | See [MR Sets](#mr-sets) | Dict structure validated |
### Read-Only Fields
These fields are populated when reading from a file. They cannot be set directly.
| `file_encoding` | `str` | `"UTF-8"` | Character encoding (always UTF-8 for writes) |
| `compression` | `str` | `"bytecode"` | `"none"` / `"bytecode"` / `"zlib"` |
| `creation_time` | `str` | `"2026-02-21 12:38:47"` | Auto-set at write time |
| `number_rows` | `int \| None` | `22070` | From file header |
| `number_columns` | `int` | `677` | From file header |
| `file_format` | `str` | `"sav"` | `"sav"` or `"zsav"` |
| `variable_names` | `list[str]` | `["Q1", "Q2"]` | Ordered column names |
| `variable_storage_widths` | `dict[str, int]` | `{"Q1": 8}` | Computed from format |
| `arrow_data_types` | `dict[str, str]` | `{"Q1": "f64"}` | From Arrow schema |
---
## Convenience Methods
```python
meta.format("Q1") # str | None — SPSS format string (e.g. "F8.2")
meta.measure("Q1") # str | None — measurement level
meta.role("Q1") # str | None — variable role
meta.attribute("Q1") # dict | None — all custom attributes
meta.attribute("Q1", "$@Role") # list[str] | KeyError — specific attribute
meta.summary() # formatted overview of all metadata
meta.describe("Q1") # detailed single-variable view
meta.describe(["Q1", "Q2"]) # multiple variables
meta.diff(other_meta) # compare two metadata objects
```
---
## Write-Time Defaults
When writing with `am.write_sav(df, path, meta=meta)`, any metadata fields not explicitly set are automatically filled from the DataFrame schema:
| `variable_formats` | `F8.2` (numeric), `A255` (string), `DATE11`, `DATETIME23.2`, `TIME11.2` | Arrow type |
| `variable_measures` | `scale` (numeric/temporal), `nominal` (string/bool) | Arrow type |
| `variable_alignments` | `right` (numeric), `left` (string) | Arrow type |
| `variable_display_widths` | Format width (8 for `F8.2`, 255 for `A255`, etc.) | Format |
| `variable_storage_widths` | 8 (numeric), format width (string) | Format |
| `variable_roles` | `input` | SPSS default |
| `file_label` | `""` (empty) | — |
| `creation_time` | Current UTC time | — |
| `notes` | `[]` (none) | — |
User-set fields are never overwritten. Extra metadata for columns not in the DataFrame is silently ignored.
---
## Missing Values
SPSS supports three types of missing value specifications for variables.
### Discrete Missing Values
Up to 3 unique values that represent missing data.
```python
# Numeric discrete (most common)
meta = meta.update(variable_missing_values={
"Q1": {"type": "discrete", "values": [98, 99]},
"Q2": {"type": "discrete", "values": [97, 98, 99]}, # max 3 values
})
# String discrete (max 8 characters each)
meta = meta.update(variable_missing_values={
"city": {"type": "discrete", "values": ["N/A", "DK", "RF"]},
})
```
### Range Missing Values (Numeric Only)
A continuous range where all values between `low` and `high` (inclusive) are treated as missing.
```python
meta = meta.update(variable_missing_values={
"score": {"type": "range", "low": 900, "high": 999},
})
```
### Range + One Discrete Value (Numeric Only)
A range plus one additional discrete value outside the range.
```python
meta = meta.update(variable_missing_values={
"income": {"type": "range", "low": 999990, "high": 999999, "discrete": 0},
})
```
### Remove Missing Values
```python
meta = meta.update(variable_missing_values={"Q1": None})
```
### Validation Rules
**General:**
- Cannot mix numeric and string values in the same spec — `ValueError` if attempted
- At write time, missing value type is validated against the actual column type (e.g., string missing values on a numeric column raises an error)
**Numeric variables:**
- Discrete: maximum 3 values, must be unique (no duplicates)
- Range: `low` must be less than `high`
- Range + discrete: the discrete value must NOT fall between `low` and `high`
- All values must be numeric (int or float)
**String variables:**
- Discrete only: maximum 3 values, each maximum 8 characters
- **Ranges are NOT available** for string variables — `ValueError` if attempted
---
## Value Labels
Map numeric or string values to human-readable labels.
```python
# Numeric value labels (most common)
meta = meta.update(variable_value_labels={
"gender": {1: "Male", 2: "Female", 3: "Non-binary"},
"satisfaction": {1: "Very dissatisfied", 2: "Dissatisfied",
3: "Neutral", 4: "Satisfied", 5: "Very satisfied"},
})
# String value labels
meta = meta.update(variable_value_labels={
"country": {"US": "United States", "UK": "United Kingdom"},
})
# Float keys (SPSS stores all numeric values as float64)
meta = meta.update(variable_value_labels={
"weight_class": {1.5: "Light", 2.5: "Medium", 3.5: "Heavy"},
})
# Remove value labels for a variable
meta = meta.update(variable_value_labels={"Q1": None})
```
---
## MR Sets (Multiple Response Sets)
Define groups of variables that represent multiple-response survey questions.
### Multiple Dichotomy
Binary yes/no variables where a specific `counted_value` indicates "selected".
```python
meta = meta.update(mr_sets={
"brands": {
"label": "Brands Mentioned",
"type": "dichotomy",
"counted_value": "1",
"variables": ["brand_a", "brand_b", "brand_c"],
}
})
```
### Multiple Category
Categorical response variables grouped together.
```python
meta = meta.update(mr_sets={
"hobbies": {
"label": "Hobbies Selected",
"type": "category",
"variables": ["hobby1", "hobby2", "hobby3"],
}
})
```
### Remove an MR Set
```python
meta = meta.update(mr_sets={"brands": None})
```
### Validation Rules
- **Name**: any valid name (the `$` prefix is automatically added by the SAV format — do not include it yourself or it will be doubled)
- **Type**: must be `"dichotomy"` or `"category"`
- **Variables**: must have at least 2 variables; all must be the same type (all numeric or all string)
- **Dichotomy sets**: `counted_value` is required (cannot be None)
- **Category sets**: `counted_value` is ignored (forced to None internally)
---
## Variable Attributes
Attach custom key-value metadata to individual variables. Each attribute holds a list of string values.
```python
meta = meta.update(variable_attributes={
"Q1": {
"Source": ["Survey", "Online", "Wave 3"], # multiple values
"SurveySection": ["Satisfaction"],
},
"Q2": {
"Source": ["Admin"],
},
})
```
### Remove Attributes
```python
# Remove all attributes for a variable
meta = meta.update(variable_attributes={"Q1": None})
```
### Viewing in SPSS
Custom variable attributes are not shown in SPSS's Variable View by default. To view them:
1. **Variable View**: Go to **View > Customize Variable View**, click **OK** (custom attributes are toggled on by default, but the dialog must be confirmed first)
2. **Syntax**: Run `DISPLAY ATTRIBUTES` to list all custom attributes in the output viewer
---
## SPSS Format Strings
Format strings control how values are displayed in SPSS.
### Numeric Formats
| `F8.2` | Standard numeric (width.decimals) | `1234.56` |
| `F8.0` | Integer (no decimals) | `1234` |
| `F1.0` | Single digit | `1` |
| `F12.4` | Wide with 4 decimals | `1234.5678` |
### String Formats
| `A8` | Short string | 8 chars |
| `A50` | Medium string | 50 chars |
| `A255` | Standard string | 255 chars |
| `A32767` | Very long string | 32767 chars |
### Temporal Formats
| `DATE11` | `Date32` | Date (e.g. `21-FEB-26`) |
| `DATETIME23.2` | `Timestamp(us)` | Date + time |
| `TIME11.2` | `Duration(us)` | Time duration |
---
## Complete Example: From-Scratch Workflow
```python
import ambers as am
import polars as pl
# Build DataFrame from external source
df = pl.read_excel("survey_data.xlsx")
# Build complete metadata
meta = am.SpssMetadata(
file_label="Customer Satisfaction Survey Q1 2026",
notes=["Collected via online panel", "N=1000 respondents"],
weight_variable="design_weight",
variable_labels={
"resp_id": "Respondent ID",
"age": "Age in years",
"gender": "Gender",
"q1": "Overall satisfaction",
"q2": "Likelihood to recommend",
"design_weight": "Survey design weight",
},
variable_value_labels={
"gender": {1: "Male", 2: "Female", 3: "Non-binary"},
"q1": {1: "Very dissatisfied", 5: "Very satisfied"},
"q2": {0: "Not at all likely", 10: "Extremely likely"},
},
variable_measures={
"resp_id": "nominal",
"age": "scale",
"gender": "nominal",
"q1": "ordinal",
"q2": "ordinal",
"design_weight": "scale",
},
variable_formats={
"resp_id": "F8.0",
"age": "F3.0",
"gender": "F1.0",
"q1": "F1.0",
"q2": "F2.0",
"design_weight": "F8.4",
},
variable_missing_values={
"age": {"type": "discrete", "values": [98, 99]},
"q1": {"type": "discrete", "values": [8, 9]},
"q2": {"type": "discrete", "values": [98, 99]},
},
)
am.write_sav(df, "survey_q1_2026.sav", meta=meta)
```
## Complete Example: Roundtrip with Override
```python
import ambers as am
# Read existing file
df, meta = am.read_sav("input.sav")
# Override specific fields (everything else preserved)
meta = meta.update(
file_label="Updated Survey — Wave 2",
variable_labels={"Q1": "Updated Question Label"},
variable_missing_values={"Q1": {"type": "discrete", "values": [8, 9]}},
)
am.write_sav(df, "output.sav", meta=meta)
```
## Complete Example: Chainable Builder
```python
import ambers as am
meta = (am.SpssMetadata()
.with_file_label("NPS Survey")
.with_variable_labels({"score": "NPS Score", "segment": "Customer Segment"})
.with_variable_value_labels({"segment": {1: "Enterprise", 2: "SMB", 3: "Consumer"}})
.with_variable_measures({"score": "scale", "segment": "nominal"})
.with_variable_formats({"score": "F2.0", "segment": "F1.0"})
.with_variable_missing_values({"score": {"type": "discrete", "values": [99]}})
.with_notes(["Q1 2026 collection"])
)
am.write_sav(df, "nps.sav", meta=meta)
```