roughdb 0.9.0

Embedded persistent key-value store — a LevelDB port to Rust with a compatible on-disk format
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project

RoughDB is an embedded key-value store written in Rust, porting LevelDB to Rust. The LevelDB C++ source lives at
`../leveldb/` (primary reference). RocksDB source lives at `../rocksdb/` and should only be consulted for improvements
to existing LevelDB features — not for porting RocksDB-specific features.

## Commands

```bash
cargo test                        # run all tests
cargo test memtable               # run tests matching a path/name
cargo test -- --nocapture         # show println! output
cargo bench                       # run Criterion benchmarks (benches/db.rs)
cargo clippy                      # lint
cargo fmt                         # format (2-space indent, see rustfmt.toml)
```

## Architecture

RoughDB is an LSM-tree key-value store. The logical write path is:

```
Db::put/delete/write
  → WriteBatch (encode ops)
  → WAL (durability: written before ack)
  → Memtable (in-memory skip list, most recent writes)
      ↓ (when full, seal as imm and flush)
  → SSTable L0 (sorted, immutable file)
      ↓ (background compaction)
  → SSTable L1 … L6
```

The read path checks `mem` → `imm` → each level of SSTables (newest to oldest), stopping at the first hit.

### Implemented

**Memtable** (`src/memtable/`)
- Skip list with RocksDB's `InlineSkipList` memory layout: every node is a single arena allocation containing the
  level-0 link, higher-level links prefixed before it (negative indexing), a `u32` payload length, and the
  varint-encoded entry inline. This keeps the hot level-0 link and payload in the same cache line.
- `Splice` caches `prev`/`next` from the last insert for O(1) amortised sequential inserts.
- Lock-free reads (Acquire/Release atomics); writes serialised by the DB-level write mutex in `Db` (no lock inside
  `Memtable` itself — `UnsafeCell<SkipList>` + `unsafe impl Sync`).

**Entry encoding** (`src/memtable/entry.rs`)
- Internal key format: `[klen: varint][key][seq: varint][vtype: u8][vlen: varint][value]`
- `vtype`: `0` = Deletion, `1` = Value.
- Ordering: key ASC, seq DESC — so a lookup key with `seq = u64::MAX` always seeks to the newest version of a user key.
- Allocation-free helpers (`write_value_to`, `write_lookup_to`, etc.) encode directly into arena memory.

**Arena** (`src/memtable/arena/bump.rs`)
- Thin wrapper around `bumpalo::Bump`; `allocate_aligned(size, align)` panics on OOM (matching LevelDB: memtable OOM
  is unrecoverable).
- Unlike LevelDB/RocksDB, which manage explicit 4 KB slabs and bypass them for large allocations (> slab/4), we
  delegate slab management entirely to bumpalo. The large-allocation bypass is bumpalo's internal policy rather than
  ours, but the outcome is equivalent.
- **`memory_usage()`**: `Arena` tracks exact bytes used via an explicit `AtomicUsize` counter (bumpalo's
  `allocated_bytes()` returns chunk *capacity* not bytes used). `Memtable::approximate_memory_usage()` delegates to
  this; `Db::write` compares against `options.write_buffer_size` to trigger L0 flush.  The arena grows unboundedly
  (matching LevelDB — no capacity limit); the flush threshold is the sole governing constraint.

---

## Roadmap

Features are listed in dependency order. Each phase must be complete before the next begins.

### Phase 1 — Core types

*Prerequisite for everything.*

- [x] **`Error`** (`src/error.rs`): `Result`-based error type with variants `NotFound`, `Corruption`, `InvalidArgument`,
  `NotSupported`, `IoError`. `Db::get/put/delete` now return `Result<_, Error>`. See `include/leveldb/status.h`.
- [x] **`WriteBatch`**: Serialises a sequence of Put/Delete operations into a byte buffer. Batch format: `[seq:
  u64][count: u32][records…]` where each record is `[vtype: u8][klen: varint][key][vlen: varint][value]`. Exposes `put`,
  `delete`, `iterate(Handler)`, `approximate_size`. `Clone`-able. `Db::put`/`delete` are thin wrappers over `Db::write`.
  See `include/leveldb/write_batch.h` and `db/write_batch.cc`.
- [x] **`Options` / `ReadOptions` / `WriteOptions`**: Configuration structs. `Options` carries the write-buffer size,
  block size, compression type, filter policy, block cache handle. `WriteOptions` adds a sync flag. `ReadOptions` adds a
  snapshot handle and checksum-verify flag. See `include/leveldb/options.h`.

### Phase 2 — Write-Ahead Log

*Enables crash durability and the correct sequence-number model. Must precede full `Db::Open`.*

**Step 1 — Prerequisites**

- [x] **`crc32c` crate**: Add `crc32c = "..."` to `Cargo.toml`. Used for record header checksums.
- [x] **Sequence number ownership**: Move `last_sequence` from `Memtable` to `Db`. Carried as part of `WriteState`
  (alongside the optional `LogWriter`) inside `write_lock: Mutex<WriteState>` — the type system enforces that both
  are inaccessible without holding the write lock, matching LevelDB's plain `uint64_t` under `mutex_`.
  `Memtable::add`/`delete` gain an explicit `seq: u64` parameter (internal `next_seq` counter removed). `Db::write`
  reads `start_seq = state.last_sequence + 1`, clones and stamps the batch, writes to WAL, iterates passing
  incrementing sequence numbers per record, then advances `state.last_sequence += batch.count()` after all inserts.

**Step 2 — Log primitives**

- [x] **Format** (`src/logfile/format.rs`): 32 KB blocks. Record header: `[crc32c: u32 LE][len: u16 LE][type: u8]` (7
  bytes). Fragment types: `Zero=0` (pad), `Full=1`, `First=2`, `Middle=3`, `Last=4`. Constants: `BLOCK_SIZE = 32768`,
  `HEADER_SIZE = 7`. See `db/log_format.h`.
- [x] **`logfile::Writer`** (`src/logfile/writer.rs`): Wraps a `Box<dyn WritableFile>` (pluggable via `FileSystem` trait).
  `add_record(&[u8])` fragments the payload across block boundaries; pads trailing fewer-than-7-byte remnants
  with zeros; pre-computes per-type CRCs on construction. Constructor: `new(dest: Box<dyn WritableFile>, dest_length: u64)`
  where `dest_length` allows resuming an existing file. See `db/log_writer.h/cc`.
- [x] **`logfile::Reader`** (`src/logfile/reader.rs`): Wraps `File`. `read_record() -> Option<Vec<u8>>` reassembles
  multi-fragment records into a scratch buffer. Optional CRC verification (`checksum: bool`). Reports corruption via a
  `Reporter` trait (one method: `corruption(bytes: u64, reason: &str)`). Resynchronisation: when `initial_offset > 0`,
  skips `Middle`/`Last` fragments until a `Full` or `First` is found. See `db/log_reader.h/cc`.

**Step 3 — Integration**

- [x] **`Db::open(path)`** (`src/lib.rs` or `src/db/mod.rs`): Creates the directory if absent; opens or creates
  `<path>/000001.log`; constructs `Db` with a live `logfile::Writer`. Returns `Result<Db, Error>`. `Db::default()` retained
  for in-memory/test use (no WAL). Full MANIFEST-driven log-number tracking deferred to Phase 9.
- [x] **`Db::write` wired to WAL**: Change signature to `fn write(&self, opts: &WriteOptions, batch: &WriteBatch) ->
  Result<(), Error>`. Under the write lock: read `start_seq`, call `batch.set_sequence(start_seq)`, call
  `log_writer.add_record(batch.contents())`, sync if `opts.sync`, iterate into memtable, advance `last_sequence`.
  The stamp must precede the log write so WAL records carry the correct embedded sequence for recovery replay.
  `Db::put`/`delete` updated to forward a default `WriteOptions`.
- [x] **Recovery** (`Db::open` calls this when WAL exists): Construct a `logfile::Reader` from offset 0; for each record
  decode it as a `WriteBatch` and replay via `batch.iterate(&mut inserter)` using the batch's embedded sequence; after
  all records restore `last_sequence` to the highest sequence seen. Incomplete trailing records (torn write on crash)
  are silently ignored.

### Phase 3 — SSTable format + L0 flush + disk reads

*The on-disk immutable file format, wired end-to-end: memtable flushes to L0 SSTables and `Db::get` falls through to
disk on in-memory misses. Mirrors the approach taken in Phase 2, where the WAL was integrated rather than deferred.*

**SSTable primitives**

- [x] **`Block`** (`src/table/block.rs`): Delta-encoded key-value pairs with restart points every N keys (default 16).
  Each entry: `[shared_len: varint][unshared_len: varint][value_len: varint][key_suffix][value]`. Restart point array at
  block tail enables binary search. See `table/block.h/cc`.
- [x] **`BlockIterator`** (`src/table/block.rs`): Decodes delta-encoded entries on the fly; jumps to restart points for
  seeks using `cmp_internal_keys` (user_key ASC, seq DESC). Needed by `Table::get` to decode data blocks — pulled
  forward from Phase 6. See `table/block.cc`.
- [x] **`BlockBuilder`** (`src/table/block_builder.rs`): Accumulates entries into a `Block`, tracking the last key for
  delta encoding. See `table/block_builder.h/cc`.
- [x] **`BlockHandle` / `Footer`** (`src/table/format.rs`): `BlockHandle` is an (offset, size) pair encoded as two
  varints. Footer is 48 bytes: metaindex handle + index handle + 8-byte magic (`0xdb4775248b80fb57`). Also contains
  `make_internal_key`, `parse_internal_key`, `cmp_internal_keys`. See `table/format.h`.
- [x] **`TableBuilder`** (`src/table/builder.rs`): Sequentially appends sorted key-value pairs; writes data blocks,
  index block, empty metaindex block, and footer. NoCompression only (Phase 3). No filter blocks. Index block uses
  `restart_interval=1`. See `include/leveldb/table_builder.h` and `table/table_builder.cc`.
- [x] **`Table`** (`src/table/reader.rs`): Random-access reader; reads footer + index block on `open`; `get(user_key)`
  searches index block, reads data block via `pread` (`FileExt::read_at`), returns value/tombstone/None.
  See `table/table.cc`.

**Memtable iteration** *(pulled forward from Phase 6 — required for flush)*

- [x] **`SkipList::iter()`**: Forward-only iteration over the skip list (backward deferred to Phase 6). Yields entries
  in internal-key order (user key ASC, sequence DESC) via `SkipListIter<'a>` (walks level-0 links with Acquire loads).
- [x] **`MemTableIterator`**: Thin forward iterator over `SkipList::iter()`, decoding each entry via `Entry::from_slice`
  to expose `ikey()` (SSTable internal key: `user_key || tag`) and `value()` to the flusher. `Entry::key()`/`value()`
  lifetimes refined to `&'a [u8]` so values borrow from the arena, not from the short-lived `Entry` view.

**L0 flush + disk read wiring**

- [x] **`Arena::memory_usage()`**: Tracks bytes via an explicit `AtomicUsize` counter (bumpalo's `allocated_bytes()`
  returns chunk *capacity*, not used bytes). `Memtable::approximate_memory_usage()` delegates to this counter so `Db`
  can compare against `Options::write_buffer_size`.
- [x] **`Db::open` takes `Options`**: Pass `write_buffer_size` (and future options) through to the database.
- [x] **L0 flush** (`Db`): When `mem` exceeds `write_buffer_size` after a write, `std::mem::replace` swaps in a fresh
  `Memtable`, the old one is iterated via `MemTableIterator` and flushed to `<path>/<number>.ldb` via `TableBuilder`,
  then the opened `Table` is prepended to `l0_files: Vec<(u64, Table)>`. `scan_next_file_number` prevents file-number
  collisions across sessions. All mutable state lives in `Mutex<DbState>`.
- [x] **`Db::get` disk fallback**: On miss in `mem`, scan `l0_files` newest-first, calling `Table::get` on each.
  `LookupResult::{Value, Deleted, NotInTable}` lets tombstones stop the search without continuing to older files.
  No multi-level routing yet — deferred to Phase 9.

### Phase 4 — Iterators

*Required by `Db::NewIterator` and internally by compaction. Forward `SkipList` iteration, `MemTableIterator`, and
`BlockIterator` were pulled into Phase 3; this phase completes the forward iterator stack. Backward iteration is
deferred to post-parity as it is not needed for compaction or the core read path.*

- [x] **`TwoLevelIterator`** (`src/table/two_level_iterator.rs`): Composes an index iterator and a block-opener
  function, yielding a transparent view over all blocks in an SSTable. `BlockFn` closure captures a cloned `File`
  handle and opens data blocks on demand. `BlockIter` (renamed from `BlockIterator<'a>`) now owns an `Arc<Vec<u8>>`
  so it can be stored without a lifetime parameter. `InternalIterator` trait in `src/iter.rs` provides the shared
  interface used by `TwoLevelIterator`, `MergingIterator`, and `DbIterator`. See `table/two_level_iterator.h/cc`.
- [x] **`MergingIterator`** (`src/db/merge_iter.rs`): N-way merge of sorted iterators using a heap or linear scan
  (LevelDB uses linear for small N). See `table/merger.h/cc`.
- [x] **`DbIterator`** (`src/db/db_iter.rs`): Wraps `MergingIterator`; applies snapshot sequence-number filtering,
  merges value versions, skips tombstones for the user. Forward-only for now. See `db/db_iter.h/cc`.

### Phase 5 — Version management

*Tracks the set of live SSTable files across levels, enables MANIFEST-based recovery, and gates compaction. Open
`Table` handles are stored as `Arc<Table>` directly in `FileMetaData` — no LRU table cache needed for correctness;
that is a post-parity optimisation.*

- [x] **`FileMetaData`** (`src/db/version_edit.rs`): File number, size, smallest/largest internal key bytes,
  `Option<Arc<Table>>` for the open reader (`Some` for all files in a live `Version`; `None` only during MANIFEST
  replay). `FileMetaData::with_table` is the canonical constructor. See `db/version_edit.h`.
- [x] **`VersionEdit`** (`src/db/version_edit.rs`): Atomic delta — files added/removed per level, log number, next
  file number, last sequence. LevelDB-compatible tag-based binary encoding; `table` field is in-memory only and not
  encoded. `encode`/`decode` with forward-compatible unknown-tag handling. See `db/version_edit.h/cc`.
- [x] **`Version`** (`src/db/version.rs`): Snapshot of the LSM structure — `[Vec<Arc<FileMetaData>>; 7]`.
  L0 newest-first; L1–L6 sorted by smallest key. `get(user_key, seq)` scans L0 then L1–L6 linearly (binary
  search deferred to Phase 6). Reference-counted via `Arc<Version>`. See `db/version_set.h/cc`.
- [x] **`VersionSet`** (`src/db/version_set.rs`): Maintains current `Arc<Version>`, MANIFEST log, and
  file-number/sequence state. `create` writes `MANIFEST-000002` + `CURRENT`; `recover` replays MANIFEST via a
  `Builder` that opens Tables for all live files; `log_and_apply` appends to MANIFEST and installs a new `Version`.
  `DbState` replaces `l0_files`/`next_file_number` with `version_set: Option<VersionSet>`; `Db::open` uses
  MANIFEST-driven recovery with WAL replay filtered to `seq > manifest_last_sequence`. See `db/version_set.h/cc`.

### Phase 6 — Full database

*Completes the user-facing API to match LevelDB's `include/leveldb/db.h` and wires all prior phases into a
crash-safe database with multi-level compaction. MANIFEST-driven recovery and level routing are done (Phase 5);
what remains is compaction, the full Iterator/Snapshot API, and operational hygiene (WAL rotation, file GC).*

**Operational correctness**

- [x] **WAL rotation after flush**: `begin_flush` allocates a new WAL file number and creates the new log file
  under the write lock; `finish_flush` calls `vs.set_log_number(new)` before `log_and_apply` so the MANIFEST
  atomically records the new SST and the new log number; then swaps `state.log` and deletes the old WAL.
  Prevents unbounded WAL growth. See `db/db_impl.cc: DBImpl::MakeRoomForWrite`.
- [x] **`DeleteObsoleteFiles`**: After every flush, `delete_obsolete_files` takes the lock briefly to snapshot the
  live `.ldb` set (`VersionSet::add_live_files`), `log_number`, and `manifest_number`, then releases the lock and
  scans the directory. Keep rules: `.log``log_number`, `MANIFEST-*``manifest_number`, `.ldb` in live set,
  `CURRENT`/`LOCK` always kept. `parse_db_filename` mirrors LevelDB's `ParseFileName`. Called in `Db::write` after
  `finish_flush` drops its lock. See `db/db_impl.cc: DBImpl::RemoveObsoleteFiles`.

**Compaction**

- [x] **L0→L1 compaction**: Triggered when L0 file count ≥ 4 (`L0_COMPACTION_TRIGGER`). `pick_compaction` selects
  all L0 files plus any L1 files whose user-key range overlaps the union L0 range. `do_compaction` (no lock) builds
  a `MergingIterator` over all inputs (L0 newest-first so ties resolve to newer versions), applies shadow-key
  pruning (only the newest version of each user key is kept) and tombstone elision (deletion markers are dropped
  when no data exists at L2+). Writes output SSTables to L1 (new file per `max_file_size`). `install_compaction`
  (under lock) records a single `VersionEdit` deleting all inputs and adding all outputs, then calls
  `log_and_apply`. `log_and_apply` now sorts L1–L6 files by smallest user key after each edit. `maybe_compact`
  orchestrates the three phases; called by the background thread after each flush. Compaction is fully
  asynchronous — a dedicated background thread runs flush and compaction while writers proceed. `Db` is split into
  `Arc<DbInner>` (shared with the background thread) + `JoinHandle` (joined on drop). Write backpressure:
  `make_room_for_write` sleeps 1 ms at L0 ≥ 8 (`L0_SLOWDOWN_WRITES_TRIGGER`), blocks at L0 ≥ 12
  (`L0_STOP_WRITES_TRIGGER`), and waits for an in-progress flush if `imm` is occupied. Level-score scheduling,
  compact-pointer round-robin, seek-based compaction, trivial-move optimisation, and grandparent-overlap limiting
  are all implemented. See `db/db_impl.cc: DBImpl::BackgroundCompaction`, `DoCompactionWork`, `MakeRoomForWrite`.
- [x] **`Db::compact_range(begin, end)`**: Manual compaction of a user-key range (`Option<&[u8]>` for open bounds).
  Finds the deepest level with files overlapping `[begin, end]`, then calls `compact_level_range` for each level
  from 0 to `max_level - 1`.  Each call uses the three-phase lock protocol; newly compacted files are visible to
  subsequent levels' compactions via a fresh version snapshot.  `pick_range_compaction` selects files at `level`
  that overlap `[begin, end]` plus next-level files that overlap the union range; reuses `do_compaction` and
  `install_compaction`.  No-op on in-memory databases.  See `db/db_impl.cc: DBImpl::CompactRange`.

**Iterator and snapshot API** *(all required by `include/leveldb/db.h`)*

- [x] **Backward iteration**: `SkipListIter::seek_to_last()` / `prev()` (scan level-0 from head; no back-pointers),
  `BlockIter::seek_to_last()` (last restart + scan forward) / `prev()` (binary-search restart points, scan to
  predecessor), `TwoLevelIterator::prev()` + `skip_empty_data_blocks_backward()`, `MergingIterator::prev()` /
  `find_largest()` (symmetric to `find_smallest`), `DbIterator::prev()` / `seek_to_last()` / `find_prev_user_entry()`
  (LevelDB-style direction field; Reverse direction stores entry in `saved_key`/`saved_value`; direction switch in
  `next()` handled). `InternalIterator` trait extended with `seek_to_last` + `prev`. Public `DbIter` exposes both.
  26 new unit tests across all layers; 161 tests total. See `table/iterator.h` and `db/db_iter.cc`.
- [x] **`Db::new_iterator(read_opts)`**: Returns `Result<DbIter, Error>`. Pins `Arc<Memtable>` for `mem` and `imm`
  via `ArcMemTableIter` (owned wrapper using `unsafe` transmute of `'a` lifetime, backed by the Arc). Creates
  `TwoLevelIterator` per SSTable from the current `Version`. Builds `MergingIterator``DbIterator`. Exposes:
  `valid`, `seek_to_first`, `seek_to_last`, `seek`, `next`, `prev`, `key`, `value`, `status`. `DbState.mem` and
  `.imm` changed to `Arc<Memtable>` to enable Arc cloning for the iterator. See `db/db_impl.cc`.
- [x] **`Db::GetSnapshot` / `ReleaseSnapshot`**: `get_snapshot()` locks, reads `last_sequence`, inserts into
  `DbState::snapshots: BTreeSet<u64>`, returns `Snapshot(seq)`. `release_snapshot(snap)` removes from the set.
  Compaction (`do_compaction`, `maybe_compact`, `compact_level_range`) uses `oldest_snapshot = snapshots.min()
  .unwrap_or(last_sequence)` as the visibility cutoff so entries observable through any live snapshot are not
  pruned. See `db/snapshot.h`.
- [x] **`Db::get` snapshot support**: `Db::get` now takes `&ReadOptions` as first argument. Snapshot seq comes
  from `opts.snapshot.map(|s| s.0).unwrap_or(last_sequence)`. `Memtable::get` changed to take `seq: u64`;
  uses `write_seek_key_to(key, seq)` instead of `write_lookup_to(key)` so reads are snapshot-aware.
  `lookup_size`/`write_lookup_to` gated with `#[cfg(test)]`. 4 new snapshot tests; 165 total.
  See `db/db_impl.cc: DBImpl::Get`.

**Remaining user-facing API surface** *(in `include/leveldb/db.h`)*

- [x] **`Db::GetProperty(property)`**: `get_property(&self, property: &str) -> Option<String>`. Supported
  properties: `"leveldb.num-files-at-level<N>"` (file count at level N), `"leveldb.stats"` (per-level file
  count and size; time/read/write columns emit `0` since compaction stats are not tracked),
  `"leveldb.sstables"` (one line per SSTable with file number, size, and escaped user-key range),
  `"leveldb.approximate-memory-usage"` (memtable bytes). Returns `None` for unknown properties or invalid level
  numbers. Helpers `num_files`, `level_bytes`, `debug_string` added to `Version`.
  See `db/db_impl.cc: DBImpl::GetProperty`.
- [x] **`Db::GetApproximateSizes(ranges)`**: Given a slice of `(start_key, end_key)` ranges, return approximate
  on-disk byte count for each range by walking the index blocks of overlapping SSTables. Used by tools to estimate
  data distribution. `Version::approximate_offset_of` accumulates `file_size` for files entirely before the key
  and delegates to `Table::approximate_offset_of` (index-block seek, returns block start offset or `metaindex_offset`
  as end-of-data proxy) for straddling files. Returns `0` for in-memory databases.
  See `db/db_impl.cc: DBImpl::GetApproximateSizes`.
- [x] **`Db::destroy(path)`**: Associated function (no open instance needed). Acquires `LOCK` non-blocking —
  returns `IoError` immediately if another process holds it. Enumerates the directory via `parse_db_filename`,
  deletes every recognised file except `LOCK`, then drops the flock and deletes `LOCK`. Calls `remove_dir`
  (not `remove_dir_all`) — silently ignores failure so unrecognised files are left in place.
  Returns `Ok(())` if `path` does not exist. See `db/db_impl.cc: DestroyDB`.

**`LOCK` file** *(prerequisite for the above free functions and safe multi-process use)*

- [x] **`LOCK` file**: On `Db::open`, acquire an exclusive `flock` on `<path>/LOCK` to prevent two processes from
  opening the same database simultaneously. Release on `Db::drop`. `acquire_lock` uses `libc::flock(LOCK_EX |
  LOCK_NB)` — returns `IoError` immediately rather than blocking if another process holds the lock. The `File`
  handle is stored as `lock_file: Option<File>` in `Db` (suppressed `dead_code` warning; held for Drop side-effect).
  See `util/env_posix.cc: LockFile`.

**Compression**

- [x] **Bloom filters**: Kirsch-Mitzenmacher double-hashing; `k = round(0.69 * bits_per_key)` hash functions; one filter
  per ~2 KB of data blocks stored in the SSTable filter block. The filter block is written uncompressed and read at
  `Table::open`; `Table::get` checks it after the index-block lookup and before reading or decompressing any data
  block, so a definite-negative skips all I/O and decompression entirely. A `FilterPolicy` trait allows custom
  filters. `Options::filter_policy` wires the chosen policy into flush and compaction.
  See `util/bloom.cc` and `include/leveldb/filter_policy.h`.
- [x] **Block compression**: Wire `Options::compression` and `Options::zstd_compression_level` through `TableBuilder`
  and `Table`.  `write_raw_block` in `src/table/format.rs` compresses data blocks via `snap` (Snappy) or `zstd`
  crates; falls back to NoCompression when compressed output is not at least 12.5% smaller (LevelDB's
  `kCompressionRatio` heuristic).  `read_block` decompresses based on the trailer byte (`0x00` = none, `0x01` =
  Snappy, `0x02` = Zstd).  Filter, metaindex, and index blocks are always written uncompressed (matching LevelDB).
  `TableBuilder::new` accepts `compression: CompressionType` and `zstd_level: i32`; both are passed from flush and
  compaction via `Options`.  Default remains `Snappy` (matching LevelDB).

**Options wiring**

- [x] **`ReadOptions::verify_checksums` + `Options::paranoid_checks`**: Both are fully wired.
  `verify_checksums = opts.verify_checksums || self.options.paranoid_checks` is computed in `Db::get_with_options`
  and `Db::new_iterator` and forwarded to `Version::get(verify_checksums)` and `Table::new_iterator(verify_checksums)`
  (which threads it into the `BlockFn` closure).  `Options::paranoid_checks` is also passed to `VersionSet::recover`
  and `recover_wal` as the `checksum` argument to `logfile::Reader::new`, so MANIFEST and WAL records are verified during
  recovery.  See `db/db_impl.cc: DBImpl::Get`, `table/table.cc: Table::InternalGet`.

---

### Post-parity improvements

*Optimisations and secondary features. None are prerequisites for the phases above. Items marked ✅ are done.*

- **Table cache**: LRU `TableCache` (`src/db/table_cache.rs`) — capacity
  `max_open_files - NUM_NON_TABLE_CACHE_FILES (10)`, internally synchronized via `Arc<Mutex<Inner>>`. `get_or_open`
  opens on miss and evicts LRU when at capacity; `insert` for newly-built tables; `evict` for compacted-away files.
  `FileMetaData` holds only durable metadata; tables opened lazily on first access. `Db` holds
  `table_cache: Option<TableCache>` (None for in-memory). See `db/table_cache.h/cc`.
-**`Options::max_open_files`**: Fully wired via `TableCache`. Capacity =
  `max_open_files.saturating_sub(NUM_NON_TABLE_CACHE_FILES).max(1)`. See `db/db_impl.cc: DBImpl::Open`.
-**Block cache**: Simple LRU `BlockCache` (`src/cache/mod.rs`) keyed by `(cache_id, block_offset)`. Per-table
  `cache_id` from `AtomicU64` prevents cross-file collisions. `Block` is `Clone` (wraps `Arc<Vec<u8>>`), so cache
  hits return a cheap clone. `Table` holds `cache_id: u64` + `Option<Arc<BlockCache>>`; `read_data_block` checks
  cache on hit, reads from disk on miss, inserts when `fill_cache = true`. `new_iterator` block-opener closure
  similarly checks/inserts the cache. Default: 8 MiB (`Options::block_cache`). See `util/cache.cc`.
-**`ReadOptions::fill_cache`**: Fully wired. `fill_cache = false` skips inserting blocks into the block cache
  (used by `do_compaction` to avoid polluting the cache with bulk-scan data). Threaded through `Db::get_with_options`,
  `Db::new_iterator`, `Version::get`, `Table::get`, and `Table::new_iterator`.
-**Batch-grouped writes**: Multiple concurrent `Db::write` callers are grouped by a leader that writes a single
  combined WAL record and memtable batch, amortising `fsync` cost over many writers. `DbState` holds
  `writers: VecDeque<WriterSlot>`, `next_writer_id: u64`, and `completed: HashMap<u64, Result<(), Error>>`;
  `Db` holds `write_condvar: Condvar` (paired with `Mutex<DbState>`). Group bounds match LevelDB's
  `BuildBatchGroup`: max 1 MiB; tightened to `first_size + 128 KiB` when the first batch is ≤ 128 KiB; a
  `sync=true` writer won't join a non-sync group. Leader pops followers, stores results in `completed`, calls
  `notify_all`; followers retrieve their result by `id` and return. Leader handles flush/compaction after.
  See `db/db_impl.cc: DBImpl::Write`.
-**Custom comparator**: `Comparator` trait (`src/comparator.rs`) with `compare`, `name`,
  `find_shortest_separator`, `find_short_successor`. `BytewiseComparator` is the default. `Options::comparator:
  Arc<dyn Comparator>` threaded through all ~40 comparison sites: skip list entry ordering, `cmp_internal_keys`,
  `BlockIter::seek`, `MergingIterator`, `DbIterator`, `Table::get`, `TableBuilder` (index separators shortened via
  `find_shortest_separator`/`find_short_successor`), `VersionSet` file sorting, and all `lib.rs` compaction/overlap
  helpers. `VersionEdit` encodes/decodes the comparator name; `VersionSet::recover` rejects mismatches.
  See `include/leveldb/comparator.h`.
-**`FileSystem` abstraction**: `FileSystem` trait (`src/env.rs`) with `SequentialFile`, `RandomAccessFile`,
  `WritableFile`, and `FileLock` sub-traits. `PosixFileSystem` is the default (POSIX `std::fs` + `libc::flock`).
  `Options::file_system: Arc<dyn FileSystem>` threaded through all I/O: `LogWriter`, `LogReader`, `Table`,
  `TableBuilder`, `TableCache`, `VersionSet`, and all `lib.rs` functions (flush, compaction, repair, destroy, GC).
  `libc` usage isolated to `PosixFileSystem`. Enables in-memory, encrypted, or cloud storage backends.
  See `include/rocksdb/file_system.h`.
-**`Db::repair(path, options)`**: Scans the database directory for surviving `.ldb` and `.log` files, converts WALs
  to SSTables via `LogReader``Memtable``TableBuilder`, extracts metadata from all SSTables, writes a fresh
  `MANIFEST-000001` placing all files at L0, and writes `CURRENT`.  Corrupt records/files are skipped and archived to
  `lost/`.  Port of `db/repair.cc: RepairDB`.
-**`Options::reuse_logs`**: When `false` (default), replayed WAL data is flushed to an SSTable during `Db::open`
  so the next open doesn't need to replay it.  When `true`, the replayed memtable is kept as the active `mem` and
  the existing WAL is re-opened for appending — faster opens at the cost of replaying the WAL on the next open.
  See `db/db_impl.cc: DBImpl::RecoverLogFile`.
- **`Db::flush_wal(sync: bool)` / `Db::sync_wal()` + manual flush mode**: RocksDB's WAL writer supports two modes.
  In auto-flush mode (default, `manual_flush=false`), `emit_physical_record` flushes the buffer to the OS after every
  `AddRecord` — this is what our `LogWriter` currently does.  In manual-flush mode (`manual_flush=true`), records are
  only appended to the internal buffer; the caller must explicitly call `DB::FlushWAL` to push data to the OS.  Manual
  flush mode enables higher write throughput by batching multiple WAL records into a single OS write.
  `FlushWAL(sync=false)` calls `dest_->Flush()` (push to page cache); `FlushWAL(sync=true)` also `fsync`s.
  `SyncWAL()` is shorthand for `FlushWAL(true)`.  Implementation: add `manual_flush: bool` to `LogWriter`; when
  `true`, skip the `self.dest.flush()` in `emit_physical_record`.  Add `Db::flush_wal(sync)` and `Db::sync_wal()`.
  See `db/log_writer.cc: Writer::AddRecord` (lines 180–183) and `DB::FlushWAL` in `include/rocksdb/db.h`.
- **WAL file recycling** (RocksDB): Reuse old log files from a pool rather than deleting and recreating them, avoiding
  inode churn on slow filesystems. Uses a 12-byte header (standard 7 bytes + `log_number: u32`) so the reader can
  distinguish recycled files. See `db/log_writer.cc` in RocksDB.
- **Ribbon filters** (RocksDB): Drop-in replacement for Bloom filters offering the same false-positive rate in ~30%
  fewer bits. See `util/ribbon_impl.h` in RocksDB.
- **`Db::repair` directory locking** (RocksDB): Acquire the `LOCK` file during repair to prevent concurrent access.
  LevelDB does not lock during repair; RocksDB does (`env_->LockFile` in `db/repair.cc`). Our current implementation
  follows LevelDB and does not lock.
-**`CompactionFilter` / `CompactionFilterFactory`** (RocksDB): `CompactionFilter` trait
  (`src/compaction_filter.rs`) with `filter(level, key, value, value_type) -> CompactionDecision` returning `Keep`,
  `Remove`, or `ChangeValue(new_val)`.  `CompactionFilterFactory` creates a fresh filter per compaction run
  (allows per-compaction mutable state like timestamps for TTL).  `Options::compaction_filter_factory:
  Option<Arc<dyn CompactionFilterFactory>>``None` disables filtering (default).  The filter is called in
  `do_compaction` for the newest visible version of each key, after shadow-key pruning, before writing to the output
  SSTable.  Entries visible to a live snapshot are never filtered.
  See `include/rocksdb/compaction_filter.h`.

---

## Key conventions

- `rustfmt.toml` sets `tab_spaces = 2`.
- `unsafe` blocks must carry a `// SAFETY:` comment; `unsafe fn` must carry a `# Safety` doc section.
- `find_splice_for_level` and `recompute_splice_levels` are free functions (not `SkipList` methods) to avoid a
  simultaneous `&self` / `&mut self.splice` borrow conflict — a pattern to follow whenever a method needs both `&self`
  and `&mut self.field`.
- Methods or functions only used in tests are gated with `#[cfg(test)]`.
- Prefer encoding directly into arena/pre-allocated memory (see `Entry` helpers) over intermediate `Vec` allocations on
  hot paths.
- Run `cargo fmt` on changesets
- Have text and other files hardwrap at 120 character lines