neleus-db 0.2.0

Local-first Merkle-DAG database for AI agents with cryptographic proofs and immutable versioning
Documentation
# DESIGN

## 1. Core model

Canonical truth is immutable content-addressed data.

- Blobs (`BlobStore`): raw byte payloads under `blobs/`.
- Objects (`ObjectStore`): typed DAG-CBOR objects under `objects/`.
  - manifests
  - state manifests and segments
  - commits

All identities are hashes of canonical bytes plus domain separators.

## 2. Canonical encoding and hashing

- Hash algorithm: BLAKE3.
- Canonical serialization: DAG-CBOR (`serde_ipld_dagcbor`).
- Domain separators prevent cross-type collisions.

Hash formulas:

- `H_blob = blake3("blob:" || blob_bytes)`
- `H_manifest = blake3("manifest:" || dag_cbor(manifest))`
- `H_state_node = blake3("state_node:" || dag_cbor(state_obj))`
- `H_commit = blake3("commit:" || dag_cbor(commit))`
- `H_state_leaf = blake3("state_leaf:" || leaf_encoding)`
- `H_merkle_node = blake3("merkle_node:" || left_hash || right_hash)`

`src/canonical.rs` includes golden-byte tests that lock deterministic canonical bytes for selected structures.

## 3. State model (segmented persistent map)

State root points to:

```text
StateManifest {
  schema_version,
  segments: [newest ... oldest],
  segments_merkle_root
}
```

Each immutable segment:

```text
StateSegment {
  schema_version,
  entries: sorted unique (key -> value_hash | tombstone),
  merkle_root
}
```

Operations:

- `set(root, key, value)` appends one update segment.
- `del(root, key)` appends one tombstone segment.
- `get(root, key)` scans segments newest to oldest.
- `compact(root)` merges visible keys into one segment and prunes tombstones.

## 4. State proof model

`StateProof` carries a compact manifest commitment view, not full manifest payload:

- `manifest_schema_version`
- `manifest_segment_count`
- `manifest_segments_root`
- `scans[]` for the scanned prefix
- `outcome`

Each scan contains:

- `segment_hash`
- manifest inclusion proof for `segment_hash`
- segment merkle metadata (`segment_merkle_root`, `segment_leaf_count`)
- key proof (inclusion or non-inclusion)
- verdict

Verification flow:

1. Load manifest from `state_root`.
2. Validate commitment fields against loaded manifest.
3. Verify segment inclusion proofs in manifest segment merkle root.
4. Verify key proofs in each segment merkle root.
5. Enforce scan-prefix termination semantics.

Proof limitation (MVP): non-membership remains scan-prefix based, so proof size scales with scanned segments.

## 5. Commit graph and auth hooks

Commit schema:

```text
Commit {
  schema_version,
  parents,
  timestamp,
  author,
  message,
  state_root,
  manifests,
  signature?: CommitSignature,
  payload_hash?: Hash
}
```

`payload_hash` is `hash_typed("commit_payload:", dag_cbor(unsigned_commit))` where `unsigned_commit` is the commit with both `signature` and `payload_hash` cleared. It is recorded on signed commits so verifiers can re-derive what was actually signed and check the cryptographic signature against it.

Extension hooks:

- `trait CommitSigner` — produces a signature over a payload hash supplied by `create_signed_commit`.
- `trait CommitVerifier` — receives `(commit_hash, &Commit, payload_hash)`. The caller (`CommitStore::verify_commit_with`) re-derives the payload hash and checks it matches the value stored on the commit before invoking the verifier, so implementations only need to perform the cryptographic signature check against the supplied hash. Verifiers must reject unsigned commits and commits whose body has been tampered.

`create_signed_commit` uses these hooks without forcing a specific crypto backend.

## 6. Crash safety and WAL recovery

- File writes use temp-file + rename.
- Mutable ops write structured WAL entries first.
- `Database::open` acquires recovery lock, migrates config, replays/rolls back pending WAL.
- Ref WAL entries are replayed.
- Interrupted immutable state mutation WAL entries are rolled back (discarded), because content-addressed objects are immutable.

## 7. Multi-process safety

- Ref writes acquire `refs/.refs.lock`.
- Recovery/migration acquires `meta/recovery.lock`.

This prevents concurrent mutation/recovery races.

## 8. Schema versioning and migration

- DB config has `schema_version` with migration path from legacy v1 format.
- Manifests include `schema_version` and migration helpers.
- State and commits include schema fields and migration defaults.

## 9. Verify-on-read integrity mode

`meta/config.json` includes `verify_on_read`.

When enabled:

- `BlobStore::get` re-hashes bytes and rejects mismatches.
- Typed `ObjectStore` reads re-hash with expected tag and reject mismatches.

## 10. Encryption at rest

Encryption is optional and configured through `EncryptionConfig`.

- Supported AEAD algorithms: `aes-256-gcm`, `chacha20-poly1305`
- KDF: `pbkdf2` (PBKDF2-HMAC-SHA256)
- Randomness source: OS CSPRNG via `getrandom`
- Payload metadata stores algorithm, KDF, salt, nonce, and iterations

Decryption failures are treated as authentication failures (wrong key or tampered ciphertext).

## 11. Derived search indexes (implemented)

Search indexes are derived artifacts stored under:

```text
index/<commit_hash>/search_index.json
```

They are rebuilt from commit manifests and are not canonical source of truth.

### 11.1 Build pipeline

`SearchIndexStore::build_for_head(commit, ...)`:

1. Load commit manifests.
2. Ingest `DocManifest` chunks as semantic documents.
3. Ingest `ChunkManifest` text + embeddings (if present).
4. Build semantic tables:
   - tokenized postings
   - per-document lengths
   - term document frequencies
   - average doc length
5. Persist JSON index and return an index version hash.

### 11.2 Semantic search

- Query mode: BM25-style scoring over chunk text.
- Tokenization: lowercase alphanumeric split.
- Retrieval: top-k by score.

### 11.3 Vector search

- Embeddings are read from chunk embedding blobs.
- Supported embedding blob formats: DAG-CBOR `Vec<f32>`/`Vec<f64>` and JSON float arrays.
- Similarity: cosine on matching-dimension vectors.
- Retrieval: top-k by score.

### 11.4 API surface

`src/index/mod.rs` provides:

- `SearchIndexStore` build/read/search methods
- `SearchHit` result type
- `trait IndexBuilder` compatibility hook

This keeps canonical storage minimal while providing native local semantic and vector search.