datawal 0.1.0-alpha.1

Core record store for datawal: append-only framed records (CRC32C), valid-prefix recovery, bytes-based KV projection with tombstones, manual compaction, and JSONL export. v0.1-pre.
Documentation

datawal

Crates.io Docs.rs License: MIT OR Apache-2.0

datawal is a local record store: a framed append-only RecordLog plus an optional last-write-wins DataWal KV projection.

MSRV: Rust 1.75.0

What datawal is

  • RecordLog — the canonical append-only list. Every write becomes a framed, CRC-checked record on disk. Recovery is defined as the longest valid prefix: a truncated tail is reported but not fatal; a mid-stream CRC error in a closed segment is a hard error.
  • DataWal — a KV projection derived from the log. Keys are bytes; values are bytes. Last-write-wins. Delete leaves a tombstone. Reopen rebuilds the keydir from scratch by replaying the log.
  • Bytes-first. The Rust core does not parse JSON, MessagePack, or any semantic encoding. It stores and returns opaque byte slices.
  • Clean export. export_jsonl writes the live key/value state to a JSONL file (base64-encoded keys and values) via an atomic write.
  • FS plumbing in a sibling crate. Atomic POSIX primitives (write_atomic, write_once, write_append_fsync, rename_atomic, fsync_dir) live in safeatomic-rs.

When to use

  • You are manually appending JSONL and a crash truncating the file mid-record would be a problem.
  • You need a tiny local key/value store with last-write-wins semantics and no external process or network.
  • You need audit logs, checkpoint logs, or event logs for experiments, agents, crawlers, CLIs, or local daemons.
  • You want a file-based log format that is documented down to the byte level, with frozen wire-format fixtures and TLA+ invariants for the recovery protocol.
  • You want to be able to open the log, scan it, and understand exactly what is on disk — no opaque internal formats.

When not to use

  • SQL, joins, secondary indexes, or range queries.
  • A cache with TTL or eviction.
  • A FIFO queue.
  • Multi-writer or concurrent writers.
  • Distributed or network-attached storage.
  • Large object / blob / content-addressed storage.
  • DataFrame analytics (use Polars, DuckDB, etc.).
  • A production database (use SQLite, LMDB, RocksDB, etc.).

Current status

datawal is currently v0.1.0-alpha: functional and model-checked at the protocol level, but not production-ready.

It is tagged locally (git tag v0.1.0-alpha), has no remote push, and has not been published to crates.io. It is shelf-ready: correct enough to be shelved and resumed later without rediscovering the protocol.

What is in:

  • 58 tests green (cargo test --workspace).
  • 3 TLA+ models model-checked with TLC 2.19.
  • Wire-format corpus: 6 binary fixture directories, 11 corpus tests.
  • 4 runnable examples.
  • Real CRC-32C (Castagnoli, 0x1EDC6F41) per record, pinned by a known-vector test.
  • fs2 fd-based advisory lock: held by a file descriptor, not by the existence of the sentinel file. Released on Drop / process exit. A stale .lock from a crashed previous process is not a problem.
  • Durability boundary is explicit: append produces a framed, recoverable record but does not guarantee durability across a crash. Call RecordLog::fsync() to durabilise (sync_all on the active segment plus fsync_dir on the containing directory).
  • compact_to(out_dir) only — no in-place compact().

What is not in:

  • Python / PyO3 bindings.
  • Content-addressed storage / blob / dedup / CAS.
  • Compression.
  • Server or multi-user access.
  • Multi-writer.
  • Query / secondary indexes.
  • In-place compaction.
  • Reader API / concurrent reads.

Quick start

use datawal::{RecordLog, DataWal};
use std::path::Path;

// --- RecordLog ---
let path = Path::new("/tmp/my-log");
let mut log = RecordLog::open(path)?;
log.append(b"one")?;
log.append(b"two")?;
log.fsync()?;                          // durability boundary

let records = log.scan()?;
assert_eq!(records[0].payload, b"one");
assert_eq!(records[1].payload, b"two");

// --- DataWal ---
let path = Path::new("/tmp/my-kv");
let mut db = DataWal::open(path)?;
db.put(b"a", b"1")?;
db.put(b"a", b"2")?;                  // last-write-wins
assert_eq!(db.get(b"a")?, Some(b"2".to_vec()));

db.delete(b"b")?;
assert_eq!(db.get(b"b")?, None);

db.compact_to(Path::new("/tmp/my-kv-compacted"))?;
db.export_jsonl(Path::new("/tmp/my-kv.jsonl"))?;
# Ok::<(), anyhow::Error>(())

Evidence stack

The protocol has been validated at multiple levels:

Layer Evidence
Specification docs/canon.md — 14 binding clauses; byte layout
Code crates/datawal-core/src/ — ~1900 LOC Rust
Unit + integration 58 tests across tests/*.rs and embedded #[test]s
Wire-format corpus 6 binary fixture dirs, 11 corpus tests
Formal models 3 TLA+ models, model-checked with TLC 2.19
Runnable examples 4 examples under crates/datawal-core/examples/

Formal models wording: model-checked under documented assumptions. Not "formally verified". Models do not check the Rust implementation. See formal/README.md for invariants and how to run TLC.

Layout

datawal/
├── Cargo.toml             # workspace
├── crates/
│   └── datawal-core/
│       ├── src/
│       │   ├── lib.rs
│       │   ├── format.rs           # wire format, encode/decode, CRC, limits
│       │   ├── segment.rs          # segment naming and listing
│       │   ├── lock.rs             # fs2 fd-based advisory lock
│       │   ├── record_log.rs       # RecordLog
│       │   └── datawal.rs          # DataWal KV
│       ├── examples/
│       │   ├── record_log_demo.rs
│       │   ├── datawal_kv_demo.rs
│       │   ├── tail_recovery_demo.rs
│       │   └── gen_corpus.rs       # regenerate tests/corpus/* (run-on-demand)
│       └── tests/
│           ├── record_log.rs       # 14 cases
│           ├── datawal.rs          # 9 cases
│           ├── integration.rs      # 3 cases
│           ├── corpus_fixtures.rs  # 11 cases over the frozen corpus
│           └── corpus/             # binary fixtures, one subdir per fixture
├── formal/                         # TLA+ models (checked with TLC)
│   ├── RecordLog.tla
│   ├── KeydirProjection.tla
│   ├── Compaction.tla
│   ├── *.cfg
│   └── reports/                    # most recent TLC output per model
├── docs/                           # canon, technical decisions, roadmap, related work
└── dev/                            # gitignored; internal notes only

safeatomic-rs lives at ../safeatomic-rs/ and is not part of this workspace.

Running

cargo fmt --all
cargo check --workspace
cargo test --workspace
cargo run -p datawal --example record_log_demo
cargo run -p datawal --example datawal_kv_demo
cargo run -p datawal --example tail_recovery_demo
cargo doc --workspace --no-deps

Formal models

Three small TLA+ models live under formal/ and are checked with TLC 2.19+:

  • RecordLog.tla — append / fsync / crash; durable is a monotonic prefix.
  • KeydirProjection.tla — last-write-wins keydir from a put/del log.
  • Compaction.tlacompact_to preserves the live projection.

model-checked under documented assumptions — not "formally verified", does not check the Rust implementation. See formal/README.md.

Wire-format corpus

crates/datawal-core/tests/corpus/ contains binary fixtures that freeze the v0.1 on-disk format. Regenerate only when the format changes intentionally:

cargo run -p datawal --example gen_corpus

See crates/datawal-core/tests/corpus/README.md.

Related projects

  • safeatomic-rs — Rust filesystem primitives used by datawal for atomic writes and directory fsyncs.
  • safeatomic — Python package for whole-file persistence with explicit guarantees and runtime diagnostics.

safeatomic is for replacing whole files safely. datawal is for appending recoverable records and deriving local state from them.

See also

  • docs/canon.md — binding decisions and the byte-layout of a record.
  • docs/technical-decisions.md — TD-NNN entries documenting choices.
  • docs/roadmap.md — v0.1.0-alpha scope; what is frozen; next tracks.
  • formal/README.md — the TLA+ models and how to run TLC.

License

Dual-licensed under either of:

at your option.

SPDX-License-Identifier: MIT OR Apache-2.0

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.