atlas-rust 0.9.0

Directory-based store for thousands of N-dimensional datasets local or remote using object storage.
Documentation

ATLAS — Aggregated Tensor Large Array Store

A directory-based store for thousands of named datasets, each holding N-dimensional typed arrays. Built on top of the array-format (.af) binary format, with configurable compression (Zstd, LZ4, or none), chunked I/O, and an object_store backend that works on local disk, S3, GCS, Azure Blob, and in-memory.


What it does

atlas is designed for workloads where you have a large collection of similarly-shaped datasets — such as one dataset per time step, sensor station, or simulation run — and you want to query a single variable (e.g. temperature) across all of them efficiently.

Each dataset is a named group of N-dimensional arrays with typed per-dataset attributes. Datasets that share an array name (e.g. every jan_2024, feb_2024, … all have temperature) are stored together in the same physical file, keyed by dataset name inside the file.

my_store/
├── atlas.json               ← dataset registry and per-dataset attributes (JSON)
├── temperature/
│   └── data.af         ← one ArrayFile holding temperature for every dataset
├── pressure/
│   └── data.af
└── time/
    └── data.af

Durability model

atlas.json is read once when the store is opened or created. Every subsequent mutation — create_dataset, define_array, set_attribute, delete_array, delete_dataset — only touches the in-memory StoreMeta; writes to ArrayFiles buffer inside the per-array in-memory layer. Nothing reaches disk until Atlas::flush() (or Atlas::close()). Dropping an Atlas without flushing abandons every pending in-memory write.

A single Atlas::flush() walks every cached ArrayFile (writing deltas + stats) and then serialises the in-memory StoreMeta to atlas.json. This gives one durability boundary for the whole store: N datasets ⇒ one delta file per touched array name (not one per dataset) and one atlas.json rewrite (not N).

DatasetView is a borrowed handle into the atlas's shared meta — it has no flush() of its own.


File format

atlas.json

The registry is a plain JSON file written on Atlas::flush() / Atlas::close(). It stores:

  • Store version — for future format upgrades.
  • Dataset names — the complete list of datasets in the store.
  • Per-dataset attributes — typed key-value pairs serialized as plain JSON values: bool, 64-bit integer, 64-bit float, UTF-8 string, or an RFC 3339 nanosecond-precision timestamp string (e.g. "2023-11-15T07:33:20.123456789Z"). Atlas's Attr enum has five variants (Bool/Int64/Float64/String/TimestampNanoseconds); there are no narrow integer/float types on disk.
  • Array schemas — per array: dtype, shape, chunk shape, named dimensions, and the codec used when the array was first written.

Because atlas.json is human-readable and self-describing, you can inspect or audit the store contents with any JSON tool without needing the library.

<array_name>/data.af

Each array variable gets its own subdirectory with a single data.af binary file. The .af format (from the array-format crate) is a columnar, chunk-oriented binary format:

  • Multiple datasets in one file — every dataset that owns this variable is stored as a named entry inside the same file.
  • Chunked layout — arrays are split into chunks of a user-specified shape, so partial reads and writes touch only the relevant blocks.
  • Configurable compression — each block is compressed with the codec set when the store was created (default: Zstd; also LZ4 and uncompressed). The codec is persisted in atlas.json and restored automatically on open — no need to pass it again. Block target size is 8 MiB.
  • Per-array fill valuedefine_array accepts an optional FillValue (one of Bool/Int/UInt/Float/String). Unwritten cells read back as the fill value, and any written cell equal to it is counted as a null in the persisted stats (see below). Fill values are stored in the per-array footer; atlas.json is not extended.
  • Persisted statistics — on Atlas::flush(), min, max, null count, and row count are computed per array per dataset and stored alongside the data. Cells equal to the array's fill value are tallied in null_count and excluded from min/max (NaN fills match NaN cells by bit pattern). Statistics survive store reopening.
  • In-memory caches — a 256 MiB decoded block cache and a 64 MiB raw I/O cache sit in front of the object store for repeated reads.

Quick start

use atlas::{Atlas, Attr, FillValue, StoreConfig};
use ndarray::Array2;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a new store — codec is persisted to atlas.json
    let mut s = Atlas::create_path("/tmp/my_store", StoreConfig::default()).await?;

    {
        // Create a dataset and write arrays
        let mut ds = s.create_dataset("jan_2024").await?;
        ds.define_array::<f32>(
            "temperature",
            vec!["lat".into(), "lon".into()],
            vec![8, 16],
            Some(vec![4, 8]),         // chunk shape
            Some(FillValue::Float(f64::NAN)),  // returned for unwritten cells; counted as nulls in stats
        ).await?;

        let data = Array2::<f32>::from_elem([8, 16], 20.0).into_dyn();
        ds.write_array("temperature", vec![0, 0], data.view()).await?;

        ds.set_attribute("month", Attr::UInt32(1));
        ds.set_attribute("station", Attr::String("KNMI".into()));
    }

    // One flush persists atlas.json + every cached array file.
    s.flush().await?;

    // Reopen — codec is read from atlas.json, no StoreConfig needed
    let s2 = Atlas::open_path("/tmp/my_store").await?;
    let ds2 = s2.open_dataset("jan_2024").await?;

    // Full read
    let temp = ds2.read_array::<f32>("temperature", vec![], vec![]).await?.unwrap();

    // Partial read — one chunk region
    let chunk = ds2.read_array::<f32>("temperature", vec![0, 0], vec![4, 8]).await?.unwrap();

    // Query persisted statistics
    let stats = ds2.array_stats("temperature").await.unwrap();
    println!("rows={} min={:?} max={:?}", stats.row_count, stats.min, stats.max);

    Ok(())
}

Key concepts

Concept Description
Store The root directory, managed by Atlas.
Dataset A named group of arrays + typed attributes, accessed via DatasetView.
Array An N-dimensional typed array with named dimensions and an optional chunk shape.
Attribute A typed scalar attached to a dataset (metadata, not array data).
Array file One data.af file per variable name, shared across all datasets that define that variable.
Flush Atlas::flush() — the single durability boundary. Persists every cached array file and rewrites atlas.json from the in-memory StoreMeta. Must be called explicitly (or via Atlas::close() / Python's with atlas:).
Compact Atlas::compact() rewrites every cached .af file to reclaim space after deletes.
StoreConfig Configuration passed to Atlas::create. Currently holds the compression Codec.
Codec Compression codec for new array blocks: Codec::Zstd (default), Codec::Lz4, or Codec::Uncompressed. Persisted in atlas.json; open reads it automatically.

Supported dtypes

bool, i8, i16, i32, i64, u8, u16, u32, u64, f32, f64, TimestampNs, String, Binary, List<T>, FixedSizeList<T, N>

0-D scalar arrays (shape = []) are supported for any element type — useful for single-valued metadata variables like a trajectory identifier.


Comparison with NetCDF and Zarr

Similarities

  • All three formats are chunked, N-dimensional array stores.
  • All support named dimensions, per-variable metadata, and compression.
  • All are designed for scientific/numerical data.

Layout: the key difference

The critical architectural distinction is how datasets and variables are organized on disk.

NetCDF / Zarr use a dataset-first layout:

zarr_store/
├── jan_2024/
│   ├── temperature/   ← chunks for temperature
│   └── pressure/      ← chunks for pressure
└── feb_2024/
    ├── temperature/
    └── pressure/

Reading temperature for 1 000 time steps means opening 1 000 separate directories/files.

ATLAS uses a variable-first layout:

atlas_store/
├── temperature/
│   └── data.af        ← temperature for ALL datasets in one file
└── pressure/
    └── data.af

Reading temperature for 1 000 datasets means opening exactly one file. This is the primary design goal of this format.

Feature comparison

Feature NetCDF-4 Zarr v3 ATLAS
Layout Dataset-first Dataset-first Variable-first
Compression Deflate / Zstd / … Any codec plugin Zstd / LZ4 / None
Chunking Yes Yes Yes
Cloud object store No (needs FUSE/etc) Yes (native) Yes (via object_store)
Multiple datasets in one file No No Yes (all datasets per variable)
Metadata format Binary (HDF5) JSON JSON
Cross-dataset column scan Slow (N file opens) Slow (N directory opens) Fast (1 file open)
Partial reads Yes Yes Yes
Statistics (min/max/nulls) No No Yes (persisted on flush)
Self-describing metadata Yes Yes Yes (atlas.json)
Language support C/Python/Julia/… Python/Java/… Rust
Mutable after write Limited Yes Yes (chunked overwrites + compact)

When to choose ATLAS

  • You have many homogeneous datasets (same variable schema, different instances — time steps, stations, runs).
  • Your primary query is "give me variable X across all datasets" — a column scan across the dataset dimension.
  • You want a simple on-disk layout with no special runtime dependencies and human-readable metadata.
  • You are working in Rust with an async runtime and an object_store-compatible backend.

When to choose Zarr or NetCDF

  • You need wide ecosystem support (Python, Julia, C libraries, GIS tools).
  • Your primary query pattern is "give me all variables for one dataset" (dataset-first access).
  • You need hierarchical group nesting beyond two levels.
  • You need a codec ecosystem with many options (blosc, gzip, numcodecs, …).

Performance characteristics

Cross-dataset scans (where this format excels)

Because all datasets share a single .af file per variable, scanning temperature across N datasets costs:

  • 1 file open regardless of N.
  • Sequential reads within one file, which are friendly to OS read-ahead and object-store range requests.
  • Decompression only for the blocks actually read.

In a dataset-first format, the same scan requires N file or directory opens, which is bounded by metadata latency, not throughput — especially painful on object stores where each HEAD/GET has ~10 ms overhead.

Chunked I/O

Chunk shapes are set per-array at definition time. A chunk shape equal to the full array shape (the default when chunk_shape is omitted) gives a single compressed block per dataset entry, minimising overhead for small arrays. Smaller chunks allow fine-grained partial reads and updates at the cost of more blocks.

In-memory caches

Two caches sit in front of the object store:

Cache Default size What it holds
Decoded block cache 256 MiB Decompressed array chunks, ready for use
I/O cache 64 MiB Raw compressed bytes from the object store

The decoded cache means repeated reads of the same chunk cost only a hash-map lookup. Both caches are shared across all DatasetViews that open the same Atlas.

Persisted statistics

Min, max, null count, and row count are computed and persisted on every Atlas::flush(). Downstream systems can read these statistics from the opened DatasetView without touching array data at all — useful for query planning, dashboards, or data-quality checks.

Compression

The codec is chosen once at store creation via StoreConfig and written into atlas.json. Atlas::open reads it from there, so callers never need to repeat the codec choice. Each array also records its own codec in atlas.json, so a store can theoretically hold arrays written with different codecs if the schema is migrated.

Codec Trade-off
Codec::Zstd (default) Best compression ratio; moderate CPU cost
Codec::Lz4 Faster compression/decompression; larger files
Codec::Uncompressed Fastest write path; no size reduction

Choose LZ4 when decompression throughput matters more than storage size (e.g. large in-memory analytics). Choose Uncompressed for already-compressed data or when profiling shows decompression is the bottleneck.

Write path

Writes are buffered in-memory across the whole atlas. Calling Atlas::flush() compresses and writes every modified block across every cached array file, then rewrites atlas.json atomically (a single PUT). The write path scales with the number of modified chunks, not the number of datasets, and N consecutive add_xr_dataset / create_dataset calls amortise to a single flush.

Compaction

After deleting arrays or datasets, the underlying .af files may retain dead space. Calling Atlas::compact() rewrites every cached file in place, reclaiming storage.


Benchmarks

A reproducible comparison against NetCDF (netCDF4) and Zarr v3 lives in pyatlas/benchmarks/. The harness writes the same deterministic data through each backend, then measures three things:

  1. Write — total wall time to populate the collection.
  2. Read slice — read a fixed slice of every variable from every dataset, summed.
  3. Storage — total bytes on disk for the backend's directory.

Each backend uses its canonical "many datasets" layout rather than a forced apples-to-apples one — atlas uses one store with N datasets, netcdf uses N separate .nc files (the standard CMIP layout, read via xr.open_mfdataset(parallel=True)), zarr uses N separate .zarr stores (also read via open_mfdataset). The point is to compare "what's the fastest each library can do for this workload using its idiomatic pattern," not to handicap any of them.

Workloads

The harness ships two named workloads (--case):

Case Variables Per-dataset shape Typical use
sensors (default) temperature, pressure, humidity (f32/f64/f32) (24,) on (time,) Hourly weather-station fleet — tiny per-dataset I/O, per-dataset overhead dominates
gridded same three vars (100, 100, 48) on (lon, lat, time) Geophysical-style grid — actual decompression dominates

Both workloads run N=1000 datasets by default. Read slice is the first 25% of each dimension (--slice-fraction 0.25).

Results

Numbers are wall-clock seconds on a single laptop (Apple Silicon, AC power, single run — treat as relative not absolute):

--case gridded --datasets 1000 (1.8 GB raw — decompression dominates)

(100, 100, 48) per variable × 3 variables × 1000 datasets, chunk shape (50, 50, 24) (8 chunks per variable). All three backends push the (0:25, 0:25, 0:12) slice down to chunk-level reads.

Backend Layout / pattern Read slice (s) Write (s) Storage (MiB)
atlas-bulk 1 store, read_array_across_stacked with slice push-down 2.12 59 6387
atlas + --use-dask 1 store, dask-threaded per-dataset view.read_arrays(...) 3.21 60 6387
zarr 1000 separate stores, xr.open_mfdataset(parallel=True).isel(...) 5.99 38 6392
atlas (default) 1 store, serial per-dataset to_xarray(...).isel(...).load() 10.23 51 6387
netcdf 1000 .nc files, xr.open_mfdataset(parallel=True).isel(...) 13.91 122 5596

--case profile --datasets 1000 (~67 MB raw — per-dataset overhead dominates)

2 variables on (depth=50, time=168) — oceanographic-style time × depth cast.

Backend Read slice (s) Write (s) Storage (MiB)
atlas-bulk 0.08 0.77 55.5
atlas (default) 0.32 0.91 55.5
atlas + --use-dask 2.03 2.98 55.6
netcdf 4.27 2.98 62.3
zarr 4.07 11.43 61.6

This is the workload atlas was structurally designed for. The shared atlas.msgpack.zst + one .af file per variable means a 1000-dataset open is essentially free, and the read path collapses to "decompress 1000 small blocks from 2 sequential files." Both zarr's 1000-store mfdataset and netcdf's 1000-file mfdataset pay metadata overhead 1000× over.

Note --use-dask is slower than default atlas here: when per-dataset I/O is tiny, dask scheduler overhead dominates. The rule of thumb flips by workload — dask helps when per-dataset decompression is the bottleneck (gridded), hurts when per-dataset overhead is the bottleneck (profile).

--case sensors --datasets 1000 (tiny per-dataset shapes — atlas's design wins)

Backend Read slice (s) Write (s) Storage (KiB)
atlas ~0.02 ~0.1 ~80
zarr ~0.6 ~2.0 ~370
netcdf ~0.25 ~0.16 ~730

For tiny-shape datasets the comparison is dominated by per-dataset overhead, which is exactly atlas's structural advantage (one store, one metadata file, one physical file per array name).

What the numbers say

  • Reads on gridded (decompression-dominated, with realistic chunking + slice push-down): atlas-bulk reads in 2.12s vs zarr's 6.00s — 2.8× faster. atlas + --use-dask (using the new view.read_arrays(...) fast path) hits 3.21s — 1.9× faster than zarr. The win comes from atlas's structural advantage (one open of one file per variable, in-memory metadata) combined with APIs that bypass to_xarray's xr.Dataset + per-chunk dask graph overhead. zarr and netcdf both push the slice down through open_mfdataset(...).isel(...) via dask graph optimization — they're not penalised by the chunking; atlas just wins by amortising metadata and skipping per-dataset Python overhead.
  • Reads on profile (overhead-dominated): atlas wins by an order of magnitude. atlas-bulk is ~50× faster than zarr (0.08s vs 4.07s) because the per-dataset I/O is small enough that everything is overhead — and atlas's structural design is built for exactly that case. Default serial atlas is still ~12× faster than zarr.
  • Default atlas.to_xarray iteration is currently SLOW on chunked storage (10.23s vs zarr's 6.00s). It goes through to_xarray(name) which returns dask-backed arrays per chunk (8 chunks/var × 3 vars × 1000 datasets = 24,000 dask delayed tasks just to build the graph). Dask graph overhead exceeds the parallelism win. The fix is to use view.read_arrays(vars, start, shape) for per-dataset slice reads — that's what atlas + --use-dask does internally and why it hits 3.21s. For cross-dataset reads of the same slice from many datasets, prefer Atlas.to_xarray_many / read_array_across_stacked (the atlas-bulk row).
  • Workload sensitivity: --use-dask helps when per-dataset decompression is the bottleneck and hurts when per-dataset overhead is the bottleneck (profile: default atlas 0.32s → dask 2.03s). Picking the right API for the workload matters more than picking dask vs serial.
  • Writes on gridded: zarr is fastest (~31s vs atlas's ~50s) — each to_zarr(...) just dumps bytes into a per-dataset subtree, while atlas does more bookkeeping per call (schema registration, per-dataset attrs into the shared atlas.msgpack.zst, block addressing in the shared array file). Atlas still beats netcdf decisively.
  • Writes on profile: atlas wins ~12× (0.91s vs zarr's 11.43s, netcdf's 2.98s). At small per-dataset sizes the metadata write dominates, and atlas's amortised single-flush model wins.
  • Storage: atlas and zarr are essentially tied (both zstd); netcdf wins on the gridded case because zlib happens to compress this particular workload tighter, but loses on profile/sensors because the per-file overhead dominates.

Caveats

  • Single laptop, single run — variance can be 10–20%. The benchmark deliberately doesn't drop OS caches between runs (it measures warm-cache repeat-query performance, which is what real analytic workloads see).
  • Local filesystem only — atlas's biggest potential read win is over a high-latency object store where the metadata-shape difference dominates; that's a different benchmark.
  • netCDF4 vs h5netcdf — using the C-library binding (the more common production choice).
  • Run it yourself: pip install -e "pyatlas[bench]" then python pyatlas/benchmarks/bench_collection.py --case gridded --datasets 1000. See pyatlas/benchmarks/README.md for all flags.

Thread safety

Atlas and DatasetView are Send + Sync and work with the default multi-thread Tokio runtime.

Each physical array file is guarded by a tokio::sync::RwLock:

Operation Lock held
read_array, array_stats Shared read lock — multiple callers on the same file proceed in parallel
write_array, define_array, delete_array, Atlas::flush, Atlas::compact Exclusive write lock — serialised against all other access

The store-level array-file cache map uses a parking_lot::RwLock that is never held across an await point. The shared StoreMeta uses a separate parking_lot::Mutex and is only mutated through short, lock-release-await-free critical sections (so list_datasets, dataset_exists, etc. remain synchronous).


License

See LICENSE.