# ATLAS — Aggregated Tensor Large Array Store
[](https://github.com/maris-development/atlas/actions/workflows/ci.yaml)
[](https://crates.io/crates/atlas-rust)
[](https://docs.rs/atlas-rust)
[](https://pypi.org/project/atlas-python/)
[](https://maris-development.github.io/atlas/)
[](LICENSE)
A directory-based store for thousands of named datasets, each holding N-dimensional typed arrays. Built on top of the [`array-format`](https://github.com/robinskil/array-format) (`.af`) binary format, with configurable compression (Zstd, LZ4, or none), chunked I/O, and an [`object_store`](https://crates.io/crates/object_store) backend that works on local disk, S3, GCS, Azure Blob, and in-memory.
**Think of it as a "zip" for N-dimensional datasets.** Where a `.zip` bundles many files into one archive, ATLAS gathers many NetCDF / Zarr-style datasets — anything that's a set of named N-dimensional arrays — into a single high-performance collection. But it's more than a bundle: instead of storing each dataset whole, ATLAS lays the data out **variable-first**, so every dataset's `temperature` lives in one file. That makes it cheap to scan or slice a single variable across thousands of datasets at once, while still reading back any individual dataset as a normal NetCDF/xarray-like object.
> **Looking for Python?** The Python bindings live in [`atlas-python/`](atlas-python/) — `pip install atlas-python`, then `import atlas`. They add a NumPy-native API and first-class [xarray](https://docs.xarray.dev) integration. See the [atlas-python README](atlas-python/README.md) for usage.
>
> The rest of this document covers **how the format works and the Rust crate**.
---
## What it does
`atlas` is designed for workloads where you have a large collection of similarly-shaped datasets — such as one dataset per time step, sensor station, or simulation run — and you want to query a single variable (e.g. `temperature`) across all of them efficiently.
Each dataset is a named group of N-dimensional arrays with typed per-dataset attributes. Datasets that share an array name (e.g. every `jan_2024`, `feb_2024`, … all have `temperature`) are stored together in the same physical file, keyed by dataset name inside the file.
```text
my_store/
├── atlas.json ← dataset registry and per-dataset attributes (JSON)
├── temperature/
│ └── data.af ← one ArrayFile holding temperature for every dataset
├── pressure/
│ └── data.af
└── time/
└── data.af
```
---
## Durability model
`atlas.json` is read **once** when the store is opened or created. Every subsequent mutation — `create_dataset`, `define_array`, `set_attribute`, `delete_array`, `delete_dataset` — only touches the in-memory `StoreMeta`; writes to `ArrayFile`s buffer inside the per-array in-memory layer. **Nothing reaches disk until `Atlas::flush()` (or `Atlas::close()`).** Dropping an `Atlas` without flushing abandons every pending in-memory write.
A single `Atlas::flush()` walks every cached `ArrayFile` (writing deltas + stats) and then serialises the in-memory `StoreMeta` to `atlas.json`. This gives one durability boundary for the whole store: N datasets ⇒ one delta file per touched array name (not one per dataset) and one `atlas.json` rewrite (not N).
`DatasetView` is a borrowed handle into the atlas's shared meta — it has no `flush()` of its own.
---
## File format
### `atlas.json`
The registry is a plain JSON file written on `Atlas::flush()` / `Atlas::close()`. It stores:
- **Store version** — for future format upgrades.
- **Dataset names** — the complete list of datasets in the store.
- **Per-dataset attributes** — typed key-value pairs serialized as plain JSON values: bool, 64-bit integer, 64-bit float, UTF-8 string, or an RFC 3339 nanosecond-precision timestamp string (e.g. `"2023-11-15T07:33:20.123456789Z"`). Atlas's `Attr` enum has five variants (`Bool`/`Int64`/`Float64`/`String`/`TimestampNanoseconds`); there are no narrow integer/float types on disk.
- **Array schemas** — per array: dtype, shape, chunk shape, named dimensions, and the codec used when the array was first written.
Because `atlas.json` is human-readable and self-describing, you can inspect or audit the store contents with any JSON tool without needing the library.
### `<array_name>/data.af`
Each array variable gets its own subdirectory with a single `data.af` binary file. The `.af` format (from the `array-format` crate) is a columnar, chunk-oriented binary format:
- **Multiple datasets in one file** — every dataset that owns this variable is stored as a named entry inside the same file.
- **Chunked layout** — arrays are split into chunks of a user-specified shape, so partial reads and writes touch only the relevant blocks.
- **Configurable compression** — each block is compressed with the codec set when the store was created (default: Zstd; also LZ4 and uncompressed). The codec is persisted in `atlas.json` and restored automatically on `open` — no need to pass it again. Block target size is 8 MiB.
- **Per-array fill value** — `define_array` accepts an optional `FillValue` (one of `Bool`/`Int`/`UInt`/`Float`/`String`). Unwritten cells read back as the fill value, and any *written* cell equal to it is counted as a null in the persisted stats (see below). Fill values are stored in the per-array footer; `atlas.json` is not extended.
- **Persisted statistics** — on `Atlas::flush()`, min, max, null count, and row count are computed per array per dataset and stored alongside the data. Cells equal to the array's fill value are tallied in `null_count` and excluded from `min`/`max` (NaN fills match NaN cells by bit pattern). Statistics survive store reopening.
- **In-memory caches** — a 256 MiB decoded block cache and a 64 MiB raw I/O cache sit in front of the object store for repeated reads.
---
## Quick start
```rust
use atlas::{Atlas, Attr, FillValue, StoreConfig};
use ndarray::Array2;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Create a new store — codec is persisted to atlas.json
let mut s = Atlas::create_path("/tmp/my_store", StoreConfig::default()).await?;
{
// Create a dataset and write arrays
let mut ds = s.create_dataset("jan_2024").await?;
ds.define_array::<f32>(
"temperature",
vec!["lat".into(), "lon".into()],
vec![8, 16],
Some(vec![4, 8]), // chunk shape
Some(FillValue::Float(f64::NAN)), // returned for unwritten cells; counted as nulls in stats
).await?;
let data = Array2::<f32>::from_elem([8, 16], 20.0).into_dyn();
ds.write_array("temperature", vec![0, 0], data.view()).await?;
ds.set_attribute("month", Attr::UInt32(1));
ds.set_attribute("station", Attr::String("KNMI".into()));
}
// One flush persists atlas.json + every cached array file.
s.flush().await?;
// Reopen — codec is read from atlas.json, no StoreConfig needed
let s2 = Atlas::open_path("/tmp/my_store").await?;
let ds2 = s2.open_dataset("jan_2024").await?;
// Full read
let temp = ds2.read_array::<f32>("temperature", vec![], vec![]).await?.unwrap();
// Partial read — one chunk region
let chunk = ds2.read_array::<f32>("temperature", vec![0, 0], vec![4, 8]).await?.unwrap();
// Query persisted statistics
let stats = ds2.array_stats("temperature").await.unwrap();
println!("rows={} min={:?} max={:?}", stats.row_count, stats.min, stats.max);
Ok(())
}
```
---
## Key concepts
| Concept | Description |
| --- | --- |
| **Store** | The root directory, managed by `Atlas`. |
| **Dataset** | A named group of arrays + typed attributes, accessed via `DatasetView`. |
| **Array** | An N-dimensional typed array with named dimensions and an optional chunk shape. |
| **Attribute** | A typed scalar attached to a dataset (metadata, not array data). |
| **Array file** | One `data.af` file per variable name, shared across all datasets that define that variable. |
| **Flush** | `Atlas::flush()` — the single durability boundary. Persists every cached array file and rewrites `atlas.json` from the in-memory `StoreMeta`. Must be called explicitly (or via `Atlas::close()` / Python's `with atlas:`). |
| **Compact** | `Atlas::compact()` rewrites every cached `.af` file to reclaim space after deletes. |
| **StoreConfig** | Configuration passed to `Atlas::create`. Currently holds the compression `Codec`. |
| **Codec** | Compression codec for new array blocks: `Codec::Zstd` (default), `Codec::Lz4`, or `Codec::Uncompressed`. Persisted in `atlas.json`; `open` reads it automatically. |
---
## Supported dtypes
`bool`, `i8`, `i16`, `i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `f32`, `f64`, `TimestampNs`, `String`, `Binary`, `List<T>`, `FixedSizeList<T, N>`
0-D scalar arrays (`shape = []`) are supported for any element type — useful for single-valued metadata variables like a trajectory identifier.
---
## Comparison with NetCDF and Zarr
### Similarities
- All three formats are chunked, N-dimensional array stores.
- All support named dimensions, per-variable metadata, and compression.
- All are designed for scientific/numerical data.
### Layout: the key difference
The critical architectural distinction is how datasets and variables are organized on disk.
**NetCDF / Zarr** use a *dataset-first* layout:
```text
zarr_store/
├── jan_2024/
│ ├── temperature/ ← chunks for temperature
│ └── pressure/ ← chunks for pressure
└── feb_2024/
├── temperature/
└── pressure/
```
Reading `temperature` for 1 000 time steps means opening 1 000 separate directories/files.
**ATLAS** uses a *variable-first* layout:
```text
atlas_store/
├── temperature/
│ └── data.af ← temperature for ALL datasets in one file
└── pressure/
└── data.af
```
Reading `temperature` for 1 000 datasets means opening exactly **one file**. This is the primary design goal of this format.
### Feature comparison
| Feature | NetCDF-4 | Zarr v3 | ATLAS |
| --- | --- | --- | --- |
| Layout | Dataset-first | Dataset-first | Variable-first |
| Compression | Deflate / Zstd / … | Any codec plugin | Zstd / LZ4 / None |
| Chunking | Yes | Yes | Yes |
| Cloud object store | No (needs FUSE/etc) | Yes (native) | Yes (via `object_store`) |
| Multiple datasets in one file | No | No | Yes (all datasets per variable) |
| Metadata format | Binary (HDF5) | JSON | JSON |
| Cross-dataset column scan | Slow (N file opens) | Slow (N directory opens) | Fast (1 file open) |
| Partial reads | Yes | Yes | Yes |
| Statistics (min/max/nulls) | No | No | Yes (persisted on flush) |
| Self-describing metadata | Yes | Yes | Yes (`atlas.json`) |
| Language support | C/Python/Julia/… | Python/Java/… | Rust |
| Mutable after write | Limited | Yes | Yes (chunked overwrites + compact) |
### When to choose ATLAS
- You have many homogeneous datasets (same variable schema, different instances — time steps, stations, runs).
- Your primary query is "give me variable X across all datasets" — a column scan across the dataset dimension.
- You want a simple on-disk layout with no special runtime dependencies and human-readable metadata.
- You are working in Rust with an async runtime and an `object_store`-compatible backend.
### When to choose Zarr or NetCDF
- You need wide ecosystem support (Python, Julia, C libraries, GIS tools).
- Your primary query pattern is "give me all variables for one dataset" (dataset-first access).
- You need hierarchical group nesting beyond two levels.
- You need a codec ecosystem with many options (blosc, gzip, numcodecs, …).
---
## Performance characteristics
### Cross-dataset scans (where this format excels)
Because all datasets share a single `.af` file per variable, scanning `temperature` across N datasets costs:
- **1 file open** regardless of N.
- Sequential reads within one file, which are friendly to OS read-ahead and object-store range requests.
- Decompression only for the blocks actually read.
In a dataset-first format, the same scan requires N file or directory opens, which is bounded by metadata latency, not throughput — especially painful on object stores where each `HEAD`/`GET` has ~10 ms overhead.
### Chunked I/O
Chunk shapes are set per-array at definition time. A chunk shape equal to the full array shape (the default when `chunk_shape` is omitted) gives a single compressed block per dataset entry, minimising overhead for small arrays. Smaller chunks allow fine-grained partial reads and updates at the cost of more blocks.
### In-memory caches
Two caches sit in front of the object store:
| Cache | Default size | What it holds |
| --- | --- | --- |
| Decoded block cache | 256 MiB | Decompressed array chunks, ready for use |
| I/O cache | 64 MiB | Raw compressed bytes from the object store |
The decoded cache means repeated reads of the same chunk cost only a hash-map lookup. Both caches are shared across all `DatasetView`s that open the same `Atlas`.
### Persisted statistics
Min, max, null count, and row count are computed and persisted on every `Atlas::flush()`. Downstream systems can read these statistics from the opened `DatasetView` without touching array data at all — useful for query planning, dashboards, or data-quality checks.
### Compression
The codec is chosen once at store creation via `StoreConfig` and written into `atlas.json`. `Atlas::open` reads it from there, so callers never need to repeat the codec choice. Each array also records its own codec in `atlas.json`, so a store can theoretically hold arrays written with different codecs if the schema is migrated.
| Codec | Trade-off |
| --- | --- |
| `Codec::Zstd` (default) | Best compression ratio; moderate CPU cost |
| `Codec::Lz4` | Faster compression/decompression; larger files |
| `Codec::Uncompressed` | Fastest write path; no size reduction |
Choose LZ4 when decompression throughput matters more than storage size (e.g. large in-memory analytics). Choose Uncompressed for already-compressed data or when profiling shows decompression is the bottleneck.
### Write path
Writes are buffered in-memory across the whole atlas. Calling `Atlas::flush()` compresses and writes every modified block across every cached array file, then rewrites `atlas.json` atomically (a single `PUT`). The write path scales with the number of modified chunks, not the number of datasets, and N consecutive `add_xarray_dataset` / `create_dataset` calls amortise to a single flush.
### Compaction
After deleting arrays or datasets, the underlying `.af` files may retain dead space. Calling `Atlas::compact()` rewrites every cached file in place, reclaiming storage.
---
## Benchmarks
A reproducible comparison against NetCDF (`netCDF4`) and Zarr v3 lives in [`pyatlas/benchmarks/`](pyatlas/benchmarks/). The harness writes the **same** deterministic data through each backend, then measures three things:
1. **Write** — total wall time to populate the collection.
2. **Read slice** — read a fixed slice of every variable from every dataset, summed.
3. **Storage** — total bytes on disk for the backend's directory.
Each backend uses its **canonical "many datasets" layout** rather than a forced apples-to-apples one — `atlas` uses one store with N datasets, `netcdf` uses N separate `.nc` files (the standard CMIP layout, read via `xr.open_mfdataset(parallel=True)`), `zarr` uses N separate `.zarr` stores (also read via `open_mfdataset`). The point is to compare "what's the fastest each library can do for this workload using its idiomatic pattern," not to handicap any of them.
### Workloads
The harness ships two named workloads (`--case`):
| Case | Variables | Per-dataset shape | Typical use |
|---|---|---|---|
| `sensors` (default) | `temperature`, `pressure`, `humidity` (f32/f64/f32) | `(24,)` on `(time,)` | Hourly weather-station fleet — tiny per-dataset I/O, per-dataset overhead dominates |
| `gridded` | same three vars | `(100, 100, 48)` on `(lon, lat, time)` | Geophysical-style grid — actual decompression dominates |
Both workloads run N=1000 datasets by default. Read slice is the first 25% of each dimension (`--slice-fraction 0.25`).
### Results
Numbers are wall-clock seconds on a single laptop (Apple Silicon, AC power, single run — treat as relative not absolute):
#### `--case gridded --datasets 1000` (1.8 GB raw — decompression dominates)
`(100, 100, 48)` per variable × 3 variables × 1000 datasets, chunk
shape `(50, 50, 24)` (8 chunks per variable). All three backends push
the `(0:25, 0:25, 0:12)` slice down to chunk-level reads.
| Backend | Layout / pattern | Read slice (s) | Write (s) | Storage (MiB) |
|---|---|---:|---:|---:|
| **atlas-bulk** | 1 store, `read_array_across_stacked` with slice push-down | **2.12** | 59 | 6387 |
| **atlas + `--use-dask`** | 1 store, dask-threaded per-dataset `view.read_arrays(...)` | **3.21** | 60 | 6387 |
| zarr | 1000 separate stores, `xr.open_mfdataset(parallel=True).isel(...)` | 5.99 | 38 | 6392 |
| atlas (default) | 1 store, serial per-dataset `open_as_xarray_dataset(...).isel(...).load()` | 10.23 | 51 | 6387 |
| netcdf | 1000 `.nc` files, `xr.open_mfdataset(parallel=True).isel(...)` | 13.91 | 122 | 5596 |
#### `--case profile --datasets 1000` (~67 MB raw — per-dataset overhead dominates)
2 variables on `(depth=50, time=168)` — oceanographic-style time × depth cast.
| Backend | Read slice (s) | Write (s) | Storage (MiB) |
|---|---:|---:|---:|
| **atlas-bulk** | **0.08** | 0.77 | 55.5 |
| **atlas (default)** | **0.32** | 0.91 | 55.5 |
| atlas + `--use-dask` | 2.03 | 2.98 | 55.6 |
| netcdf | 4.27 | 2.98 | 62.3 |
| zarr | 4.07 | 11.43 | 61.6 |
This is the workload atlas was structurally designed for. The shared `atlas.msgpack.zst` + one `.af` file per variable means a 1000-dataset open is essentially free, and the read path collapses to "decompress 1000 small blocks from 2 sequential files." Both zarr's 1000-store mfdataset and netcdf's 1000-file mfdataset pay metadata overhead 1000× over.
Note `--use-dask` is *slower* than default atlas here: when per-dataset I/O is tiny, dask scheduler overhead dominates. The rule of thumb flips by workload — dask helps when per-dataset decompression is the bottleneck (gridded), hurts when per-dataset overhead is the bottleneck (profile).
#### `--case sensors --datasets 1000` (tiny per-dataset shapes — atlas's design wins)
| Backend | Read slice (s) | Write (s) | Storage (KiB) |
|---|---:|---:|---:|
| atlas | ~0.02 | ~0.1 | ~80 |
| zarr | ~0.6 | ~2.0 | ~370 |
| netcdf | ~0.25 | ~0.16 | ~730 |
For tiny-shape datasets the comparison is dominated by per-dataset overhead, which is exactly atlas's structural advantage (one store, one metadata file, one physical file per array name).
### What the numbers say
- **Reads on gridded** (decompression-dominated, with realistic chunking + slice push-down): `atlas-bulk` reads in **2.12s vs zarr's 6.00s — 2.8× faster**. `atlas + --use-dask` (using the new `view.read_arrays(...)` fast path) hits **3.21s — 1.9× faster than zarr**. The win comes from atlas's structural advantage (one open of one file per variable, in-memory metadata) combined with APIs that bypass `open_as_xarray_dataset`'s xr.Dataset + per-chunk dask graph overhead. zarr and netcdf both push the slice down through `open_mfdataset(...).isel(...)` via dask graph optimization — they're not penalised by the chunking; atlas just wins by amortising metadata and skipping per-dataset Python overhead.
- **Reads on profile** (overhead-dominated): atlas wins by an order of magnitude. `atlas-bulk` is **~50× faster than zarr** (0.08s vs 4.07s) because the per-dataset I/O is small enough that everything is overhead — and atlas's structural design is built for exactly that case. Default serial `atlas` is still ~12× faster than zarr.
- **Default `atlas.open_as_xarray_dataset` iteration is currently SLOW on chunked storage** (10.23s vs zarr's 6.00s). It goes through `open_as_xarray_dataset(name)` which returns dask-backed arrays per chunk (8 chunks/var × 3 vars × 1000 datasets = 24,000 dask delayed tasks just to build the graph). Dask graph overhead exceeds the parallelism win. **The fix is to use `view.read_arrays(vars, start, shape)` for per-dataset slice reads** — that's what `atlas + --use-dask` does internally and why it hits 3.21s. For cross-dataset reads of the *same* slice from many datasets, prefer `Atlas.open_as_many_xarray_dataset` / `read_array_across_stacked` (the `atlas-bulk` row).
- **Workload sensitivity**: `--use-dask` helps when per-dataset decompression is the bottleneck and *hurts* when per-dataset overhead is the bottleneck (profile: default `atlas` 0.32s → dask 2.03s). Picking the right API for the workload matters more than picking dask vs serial.
- **Writes on gridded**: zarr is fastest (~31s vs atlas's ~50s) — each `to_zarr(...)` just dumps bytes into a per-dataset subtree, while atlas does more bookkeeping per call (schema registration, per-dataset attrs into the shared `atlas.msgpack.zst`, block addressing in the shared array file). Atlas still beats netcdf decisively.
- **Writes on profile**: atlas wins ~12× (0.91s vs zarr's 11.43s, netcdf's 2.98s). At small per-dataset sizes the metadata write dominates, and atlas's amortised single-flush model wins.
- **Storage**: atlas and zarr are essentially tied (both zstd); netcdf wins on the gridded case because zlib happens to compress this particular workload tighter, but loses on profile/sensors because the per-file overhead dominates.
### Caveats
- **Single laptop, single run** — variance can be 10–20%. The benchmark deliberately doesn't drop OS caches between runs (it measures warm-cache repeat-query performance, which is what real analytic workloads see).
- **Local filesystem only** — atlas's biggest potential read win is over a high-latency object store where the metadata-shape difference dominates; that's a different benchmark.
- **`netCDF4` vs `h5netcdf`** — using the C-library binding (the more common production choice).
- Run it yourself: `pip install -e "pyatlas[bench]"` then `python pyatlas/benchmarks/bench_collection.py --case gridded --datasets 1000`. See [`pyatlas/benchmarks/README.md`](pyatlas/benchmarks/README.md) for all flags.
---
## Thread safety
`Atlas` and `DatasetView` are `Send + Sync` and work with the default multi-thread Tokio runtime.
Each physical array file is guarded by a `tokio::sync::RwLock`:
| Operation | Lock held |
| --- | --- |
| `read_array`, `array_stats` | Shared read lock — multiple callers on the same file proceed in parallel |
| `write_array`, `define_array`, `delete_array`, `Atlas::flush`, `Atlas::compact` | Exclusive write lock — serialised against all other access |
The store-level array-file cache map uses a `parking_lot::RwLock` that is never held across an `await` point. The shared `StoreMeta` uses a separate `parking_lot::Mutex` and is only mutated through short, lock-release-await-free critical sections (so `list_datasets`, `dataset_exists`, etc. remain synchronous).
---
## License
See [LICENSE](LICENSE).