# array-format
[](https://crates.io/crates/array-format)
[](https://docs.rs/array-format)
[](https://github.com/robinskil/array-format/actions/workflows/ci.yml)
[](LICENSE)
[](https://blog.rust-lang.org/)
`array-format` stores many n-dimensional arrays in a single file. It uses a **delta/overlay architecture**: each flush produces a self-describing sidecar file that stacks on top of the base, recording only the chunks that changed. Reads fall through to older layers for unchanged chunks. Layers can be merged into a single file with `compact`.
## Why this format exists
- Store many arrays in one object or a small set of related files
- Append new arrays and update individual chunks without rewriting the whole file
- Block-level compression (LZ4, Zstd, or none) recorded per block so readers need no configuration
- Chunked or single-chunk layouts with coordinate-addressed reads
- Logical deletes with periodic compaction to reclaim space
- Works with any `object_store`-compatible backend (local filesystem, S3, GCS, Azure)
## Quick start
```rust
use std::sync::Arc;
use array_format::{ArrayFile, FileConfig, Lz4Codec};
use ndarray::Array;
use object_store::{ObjectStore, local::LocalFileSystem};
// Back the file with any object_store backend
let store = Arc::new(LocalFileSystem::new_with_prefix("/data")?) as Arc<dyn ObjectStore>;
// `path` is the base file (must end in `.af`), not a directory. Sidecars and
// stats are written next to it in the same prefix: signal.1.af, signal.stats, …
let path = object_store::path::Path::from("signal.af");
// Create a new file
let mut file = ArrayFile::create(store, path, FileConfig::new(Lz4Codec)).await?;
// Define and write a 1-D f32 array
file.define_array::<f32>("signal", vec!["t".into()], vec![1024], None, None)?;
let data = Array::from_vec(vec![0.0f32; 1024]).into_dyn();
file.write_array("signal", vec![0], data.view()).await?;
// Commit to a sidecar file ({stem}.1.af)
file.flush().await?;
// Read back — any session after open
let result = file.read_array::<f32>("signal", vec![], vec![]).await?;
```
Variable-length types like `String` and `Vec<u8>` use the same methods:
```rust
file.define_array::<String>("labels", vec!["i".into()], vec![100], None, None)?;
let labels = ndarray::arr1(&["alpha".to_string(), "beta".to_string()]).into_dyn();
file.write_array("labels", vec![0], labels.view()).await?;
let out = file.read_array::<String>("labels", vec![], vec![]).await?;
```
## Delta / overlay architecture
Opening a file discovers the base file and all sidecars:
```text
mydata.af ← base (overlay_index 0, written by ArrayFile::create)
mydata.1.af ← first flush
mydata.2.af ← second flush
...
```
Each sidecar is fully self-describing: it contains its own block table and array metadata. On read, layers are walked newest-first; the first layer that has a chunk wins. `compact()` merges all layers into a new base and deletes the sidecars.
```text
Before compact:
mydata.af mydata.1.af mydata.2.af
After compact:
mydata.af (single merged file)
```
## File API
```rust
// On-disk — store: Arc<dyn ObjectStore>, path: object_store::path::Path
ArrayFile::create(store, path, config).await?
ArrayFile::open(store, path, config).await? // base + any sidecars
// In-memory (for testing) — backed by object_store's in-memory backend
ArrayFile::create_memory(config).await? // then flush()/compact() as normal
// Schema
file.define_array::<T>(name, dim_names, shape, chunk_shape, fill_value)?
file.delete(name)? // logical delete (hidden until compact)
file.list_arrays() // Vec<MergedArrayMeta>, deleted excluded
file.num_layers()
// ndarray read / write — works for all ArrayElement types
file.write_array(name, start, ndarray_view).await?
file.read_array::<T>(name, start, shape).await? // vec![], vec![] for full array
// Attributes
file.set_attribute(name, key, AttributeValue::String("m/s".into()))?
file.get_attribute(name, key)?
// Statistics (min/max/null_count/row_count, refreshed on flush & compact)
file.array_stats(name) // Option<&ArrayStats>
// Flush and compact
file.flush().await?
file.compact().await?
```
### `FileConfig`
```rust
FileConfig::new(Lz4Codec) // defaults: 8 MiB blocks, 256 MiB block cache, 64 MiB I/O cache, no shared cache
FileConfig {
codec: ZstdCodec { level: 3 },
block_target_size: 8 * 1024 * 1024,
cache_capacity: 256 * 1024 * 1024,
io_cache_capacity: 64 * 1024 * 1024,
cache: None, // see "Sharing a cache across files"
}
```
### Sharing a cache across files
By default each `ArrayFile` builds its own `DeltaCache` sized by `cache_capacity` and
`io_cache_capacity`. When you open many files, that adds up. Set `config.cache` to a
pre-built `Arc<DeltaCache>` to put every file under one shared byte budget — entries
are keyed by `(file_path, block_id)`, so files do not interfere with each other.
```rust
use std::sync::Arc;
use array_format::{ArrayFile, DeltaCache, FileConfig, Lz4Codec};
let shared = Arc::new(DeltaCache::new(
256 * 1024 * 1024, // decompressed block budget
64 * 1024 * 1024, // raw I/O slab budget (0 to disable)
));
let mut cfg = FileConfig::new(Lz4Codec);
cfg.cache = Some(Arc::clone(&shared));
let file_a = ArrayFile::open(store.clone(), path_a, cfg).await?;
// reuse `shared` for file_b, file_c, ... — all bounded by the same budget
```
When `config.cache` is `Some`, the `cache_capacity` / `io_cache_capacity` fields are
ignored for that file.
## Supported data types
All readable and writable types implement the `ArrayElement` trait:
```rust
pub trait ArrayElement: Clone + Send + Sync + 'static {
const DTYPE: DType;
fn encode_chunk(values: &[Self]) -> Vec<u8>;
fn decode_chunk(bytes: &[u8]) -> Vec<Self>;
fn fill_element(fill: Option<&FillValue>) -> Self;
}
```
### Fixed-width numeric types
Values stored contiguously, little-endian, no per-element headers. Safe zero-copy encode/decode via memcpy.
| `u8`, `u16`, `u32`, `u64` | `UInt8` … `UInt64` |
| `i8`, `i16`, `i32`, `i64` | `Int8` … `Int64` |
| `f32`, `f64` | `Float32`, `Float64` |
### Variable-length types
Stored as an offset buffer: `N+1` u32 LE offsets followed by the concatenated payload.
| `String` | `DType::String` |
| `Vec<u8>` | `DType::Binary` |
Example (`String`, values `["cat", "", "elephant"]`):
```text
offsets (u32 LE): [0, 3, 3, 11]
payload bytes : 63 61 74 65 6C 65 70 68 61 6E 74
c a t e l e p h a n t
```
## Chunked layout
Defining an array with a `chunk_shape` smaller than the full shape tiles the array into a coordinate grid. Each chunk is stored independently and can be updated without touching others.
```rust
file.define_array::<f32>(
"grid",
vec!["x".into(), "y".into()],
vec![4000, 3000], // full shape
Some(vec![1000, 1000]), // chunk shape → 4×3 grid of chunks
Some(FillValue::Float(0.0)),
)?;
```
```text
+--------+--------+--------+
| (1,0) | (1,1) | (1,2) |
+--------+--------+--------+
| (2,0) | (2,1) | (2,2) |
+--------+--------+--------+
| (3,0) | (3,1) | (3,2) |
+--------+--------+--------+
```
`write_array` performs read-modify-write automatically for partial chunk writes. Chunks that haven't been written are filled with the array's fill value on read.
When `chunk_shape` is `None`, the entire array is stored as a single chunk.
## On-disk layout (per delta file)
```text
+-------------------------------+ 0
| [block 1: compressed bytes] |
| ... |
+-------------------------------+
| Footer (rkyv-serialized) |
+-------------------------------+
| footer_size (u64 LE) |
| magic b"ARRF" (4 bytes) |
+-------------------------------+ EOF
```
**Footer contents:**
- `version` — format version (currently `4`)
- `overlay_index` — which layer this file represents (`0` = base)
- `base_file_hint` — stem of the base file
- `blocks` — `Vec<BlockMeta>`: id, file offset, compressed/uncompressed sizes, codec
- `arrays` — `Vec<ArrayMeta>`: name, dtype, shape, chunk_shape, chunk coordinates → `ChunkAddress`, fill_value, deleted flag, attributes
**`ChunkAddress`:**
```text
(block_id: u32, offset: u32, size: u32)
```
Find the block by id, decompress, slice `[offset..offset+size]`.
The footer is serialized with `rkyv`. Reading it is a two-pass operation: first read the 12-byte trailer to get `footer_size`, then read the footer payload.
## Storage
Files are read and written through any
[`object_store`](https://docs.rs/object_store) backend — local filesystem, S3,
GCS, Azure, or its in-memory backend. Pass the `Arc<dyn ObjectStore>` and a base
`path` (ending in `.af`) to `ArrayFile::create` / `ArrayFile::open`; array-format
manages the base file, sidecars, and stats file within that store. There is no
separate storage trait to implement — in-memory use goes through
[`create_memory`](#in-memory-usage), which uses `object_store`'s in-memory backend.
## Compression codecs
| `NoCompression` | Pass-through |
| `Lz4Codec` | Fast, via `lz4_flex` |
| `ZstdCodec { level }` | Level 1–22, default 3 |
The codec is set once in `FileConfig`. Each block records its own codec in the block table, so files can be opened without specifying the codec that was used to write them.
## Deletes and compaction
`file.delete(name)` writes a tombstone to the pending layer. Deleted arrays are excluded from `list_arrays()` and all reads immediately, but their bytes remain on disk until `compact()`.
```rust
file.delete("old_array")?;
file.flush().await?;
// Later: merge all layers into a new base, delete sidecars, reclaim space
file.compact().await?;
assert_eq!(file.num_layers(), 1);
```
## Statistics
Every array carries aggregate statistics, recomputed automatically on `flush()` and `compact()` and persisted to a `{stem}.stats` sidecar (same rkyv + trailer format as the footer, magic `b"ARST"`). On `open()` they are loaded if present; a missing or unreadable stats file is not an error — `array_stats` simply returns `None` until the next flush.
```rust
file.flush().await?; // computes & writes {stem}.stats
if let Some(s) = file.array_stats("signal") {
println!("{:?} .. {:?}", s.min, s.max);
println!("{} of {} are fill/unwritten", s.null_count, s.row_count);
}
```
`ArrayStats` covers all chunks of one array:
| `name` | Array name |
| `min` / `max` | Global min/max across all chunks; `None` for dtypes without ordering |
| `null_count` | Elements equal to the fill value, including positions never written |
| `row_count` | Total element count across all chunks (the product of the array shape) |
`min`/`max` are typed via `StatValue`, which mirrors the dtype families:
```rust
pub enum StatValue {
Int(i64),
UInt(u64),
Float(f64),
Bytes(Vec<u8>), // String / Binary, compared lexicographically
TimestampNs(i64),
}
```
Stats are computed incrementally: a flush only recomputes arrays whose chunks were dirtied in that flush and merges them with the previously stored stats, so unchanged arrays are not re-scanned.
## In-memory usage
`create_memory` backs the file with `object_store`'s in-memory backend, so it
behaves exactly like an on-disk file — same `flush`/`compact` — but keeps
everything in process. Handy for tests and ephemeral pipelines.
```rust
use array_format::{ArrayFile, FileConfig, NoCompression};
let mut file = ArrayFile::create_memory(FileConfig::new(NoCompression)).await?;
file.define_array::<i32>("data", vec!["x".into()], vec![10], None, None)?;
// ... write ...
file.flush().await?;
```
## License
Licensed under the [Apache License, Version 2.0](LICENSE).