roughdb 0.9.0 - Docs.rs

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project

RoughDB is an embedded key-value store written in Rust, porting LevelDB to Rust. The LevelDB C++ source lives at
`../leveldb/` (primary reference). RocksDB source lives at `../rocksdb/` and should only be consulted for improvements
to existing LevelDB features — not for porting RocksDB-specific features.

## Commands

```bash
cargo test                        # run all tests
cargo test memtable               # run tests matching a path/name
cargo test -- --nocapture         # show println! output
cargo bench                       # run Criterion benchmarks (benches/db.rs)
cargo clippy                      # lint
cargo fmt                         # format (2-space indent, see rustfmt.toml)
```

## Architecture

RoughDB is an LSM-tree key-value store. The logical write path is:

```
Db::put/delete/write
  → WriteBatch (encode ops)
  → WAL (durability: written before ack)
  → Memtable (in-memory skip list, most recent writes)
      ↓ (when full, seal as imm and flush)
  → SSTable L0 (sorted, immutable file)
      ↓ (background compaction)
  → SSTable L1 … L6
```

The read path checks `mem` → `imm` → each level of SSTables (newest to oldest), stopping at the first hit.

### Implemented

**Memtable** (`src/memtable/`)
- Skip list with RocksDB's `InlineSkipList` memory layout: every node is a single arena allocation containing the
  level-0 link, higher-level links prefixed before it (negative indexing), a `u32` payload length, and the
  varint-encoded entry inline. This keeps the hot level-0 link and payload in the same cache line.
- `Splice` caches `prev`/`next` from the last insert for O(1) amortised sequential inserts.
- Lock-free reads (Acquire/Release atomics); writes serialised by the DB-level write mutex in `Db` (no lock inside
  `Memtable` itself — `UnsafeCell<SkipList>` + `unsafe impl Sync`).

**Entry encoding** (`src/memtable/entry.rs`)
- Internal key format: `[klen: varint][key][seq: varint][vtype: u8][vlen: varint][value]`
- `vtype`: `0` = Deletion, `1` = Value.
- Ordering: key ASC, seq DESC — so a lookup key with `seq = u64::MAX` always seeks to the newest version of a user key.
- Allocation-free helpers (`write_value_to`, `write_lookup_to`, etc.) encode directly into arena memory.

**Arena** (`src/memtable/arena/bump.rs`)
- Thin wrapper around `bumpalo::Bump`; `allocate_aligned(size, align)` panics on OOM (matching LevelDB: memtable OOM
  is unrecoverable).
- Unlike LevelDB/RocksDB, which manage explicit 4 KB slabs and bypass them for large allocations (> slab/4), we
  delegate slab management entirely to bumpalo. The large-allocation bypass is bumpalo's internal policy rather than
  ours, but the outcome is equivalent.
- **`memory_usage()`**: `Arena` tracks exact bytes used via an explicit `AtomicUsize` counter (bumpalo's
  `allocated_bytes()` returns chunk *capacity* not bytes used). `Memtable::approximate_memory_usage()` delegates to
  this; `Db::write` compares against `options.write_buffer_size` to trigger L0 flush.  The arena grows unboundedly
  (matching LevelDB — no capacity limit); the flush threshold is the sole governing constraint.

---

## Roadmap

Features are listed in dependency order. Each phase must be complete before the next begins.

### Phase 1 — Core types

*Prerequisite for everything.*

- [x] **`Error`** (`src/error.rs`): `Result`-based error type with variants `NotFound`, `Corruption`, `InvalidArgument`,
  `NotSupported`, `IoError`. `Db::get/put/delete` now return `Result<_, Error>`. See `include/leveldb/status.h`.
- [x] **`WriteBatch`**: Serialises a sequence of Put/Delete operations into a byte buffer. Batch format: `[seq:
  u64][count: u32][records…]` where each record is `[vtype: u8][klen: varint][key][vlen: varint][value]`. Exposes `put`,
  `delete`, `iterate(Handler)`, `approximate_size`. `Clone`-able. `Db::put`/`delete` are thin wrappers over `Db::write`.
  See `include/leveldb/write_batch.h` and `db/write_batch.cc`.
- [x] **`Options` / `ReadOptions` / `WriteOptions`**: Configuration structs. `Options` carries the write-buffer size,
  block size, compression type, filter policy, block cache handle. `WriteOptions` adds a sync flag. `ReadOptions` adds a
  snapshot handle and checksum-verify flag. See `include/leveldb/options.h`.

### Phase 2 — Write-Ahead Log

*Enables crash durability and the correct sequence-number model. Must precede full `Db::Open`.*

**Step 1 — Prerequisites**

- [x] **`crc32c` crate**: Add `crc32c = "..."` to `Cargo.toml`. Used for record header checksums.
- [x] **Sequence number ownership**: Move `last_sequence` from `Memtable` to `Db`. Carried as part of `WriteState`
  (alongside the optional `LogWriter`) inside `write_lock: Mutex<WriteState>` — the type system enforces that both
  are inaccessible without holding the write lock, matching LevelDB's plain `uint64_t` under `mutex_`.
  `Memtable::add`/`delete` gain an explicit `seq: u64` parameter (internal `next_seq` counter removed). `Db::write`
  reads `start_seq = state.last_sequence + 1`, clones and stamps the batch, writes to WAL, iterates passing
  incrementing sequence numbers per record, then advances `state.last_sequence += batch.count()` after all inserts.

**Step 2 — Log primitives**

- [x] **Format** (`src/logfile/format.rs`): 32 KB blocks. Record header: `[crc32c: u32 LE][len: u16 LE][type: u8]` (7
  bytes). Fragment types: `Zero=0` (pad), `Full=1`, `First=2`, `Middle=3`, `Last=4`. Constants: `BLOCK_SIZE = 32768`,
  `HEADER_SIZE = 7`. See `db/log_format.h`.
- [x] **`logfile::Writer`** (`src/logfile/writer.rs`): Wraps a `Box<dyn WritableFile>` (pluggable via `FileSystem` trait).
  `add_record(&[u8])` fragments the payload across block boundaries; pads trailing fewer-than-7-byte remnants
  with zeros; pre-computes per-type CRCs on construction. Constructor: `new(dest: Box<dyn WritableFile>, dest_length: u64)`
  where `dest_length` allows resuming an existing file. See `db/log_writer.h/cc`.
- [x] **`logfile::Reader`** (`src/logfile/reader.rs`): Wraps `File`. `read_record() -> Option<Vec<u8>>` reassembles
  multi-fragment records into a scratch buffer. Optional CRC verification (`checksum: bool`). Reports corruption via a
  `Reporter` trait (one method: `corruption(bytes: u64, reason: &str)`). Resynchronisation: when `initial_offset > 0`,
  skips `Middle`/`Last` fragments until a `Full` or `First` is found. See `db/log_reader.h/cc`.

**Step 3 — Integration**

- [x] **`Db::open(path)`** (`src/lib.rs` or `src/db/mod.rs`): Creates the directory if absent; opens or creates
  `<path>/000001.log`; constructs `Db` with a live `logfile::Writer`. Returns `Result<Db, Error>`. `Db::default()` retained
  for in-memory/test use (no WAL). Full MANIFEST-driven log-number tracking deferred to Phase 9.
- [x] **`Db::write` wired to WAL**: Change signature to `fn write(&self, opts: &WriteOptions, batch: &WriteBatch) ->
  Result<(), Error>`. Under the write lock: read `start_seq`, call `batch.set_sequence(start_seq)`, call
  `log_writer.add_record(batch.contents())`, sync if `opts.sync`, iterate into memtable, advance `last_sequence`.
  The stamp must precede the log write so WAL records carry the correct embedded sequence for recovery replay.
  `Db::put`/`delete` updated to forward a default `WriteOptions`.
- [x] **Recovery** (`Db::open` calls this when WAL exists): Construct a `logfile::Reader` from offset 0; for each record
  decode it as a `WriteBatch` and replay via `batch.iterate(&mut inserter)` using the batch's embedded sequence; after
  all records restore `last_sequence` to the highest sequence seen. Incomplete trailing records (torn write on crash)
  are silently ignored.

### Phase 3 — SSTable format + L0 flush + disk reads

*The on-disk immutable file format, wired end-to-end: memtable flushes to L0 SSTables and `Db::get` falls through to
disk on in-memory misses. Mirrors the approach taken in Phase 2, where the WAL was integrated rather than deferred.*

**SSTable primitives**

- [x] **`Block`** (`src/table/block.rs`): Delta-encoded key-value pairs with restart points every N keys (default 16).
  Each entry: `[shared_len: varint][unshared_len: varint][value_len: varint][key_suffix][value]`. Restart point array at
  block tail enables binary search. See `table/block.h/cc`.
- [x] **`BlockIterator`** (`src/table/block.rs`): Decodes delta-encoded entries on the fly; jumps to restart points for
  seeks using `cmp_internal_keys` (user_key ASC, seq DESC). Needed by `Table::get` to decode data blocks — pulled
  forward from Phase 6. See `table/block.cc`.
- [x] **`BlockBuilder`** (`src/table/block_builder.rs`): Accumulates entries into a `Block`, tracking the last key for
  delta encoding. See `table/block_builder.h/cc`.
- [x] **`BlockHandle` / `Footer`** (`src/table/format.rs`): `BlockHandle` is an (offset, size) pair encoded as two
  varints. Footer is 48 bytes: metaindex handle + index handle + 8-byte magic (`0xdb4775248b80fb57`). Also contains
  `make_internal_key`, `parse_internal_key`, `cmp_internal_keys`. See `table/format.h`.
- [x] **`TableBuilder`** (`src/table/builder.rs`): Sequentially appends sorted key-value pairs; writes data blocks,
  index block, empty metaindex block, and footer. NoCompression only (Phase 3). No filter blocks. Index block uses
  `restart_interval=1`. See `include/leveldb/table_builder.h` and `table/table_builder.cc`.
- [x] **`Table`** (`src/table/reader.rs`): Random-access reader; reads footer + index block on `open`; `get(user_key)`
  searches index block, reads data block via `pread` (`FileExt::read_at`), returns value/tombstone/None.
  See `table/table.cc`.

**Memtable iteration** *(pulled forward from Phase 6 — required for flush)*

- [x] **`SkipList::iter()`**: Forward-only iteration over the skip list (backward deferred to Phase 6). Yields entries
  in internal-key order (user key ASC, sequence DESC) via `SkipListIter<'a>` (walks level-0 links with Acquire loads).
- [x] **`MemTableIterator`**: Thin forward iterator over `SkipList::iter()`, decoding each entry via `Entry::from_slice`
  to expose `ikey()` (SSTable internal key: `user_key || tag`) and `value()` to the flusher. `Entry::key()`/`value()`
  lifetimes refined to `&'a [u8]` so values borrow from the arena, not from the short-lived `Entry` view.

**L0 flush + disk read wiring**

- [x] **`Arena::memory_usage()`**: Tracks bytes via an explicit `AtomicUsize` counter (bumpalo's `allocated_bytes()`
  returns chunk *capacity*, not used bytes). `Memtable::approximate_memory_usage()` delegates to this counter so `Db`
  can compare against `Options::write_buffer_size`.
- [x] **`Db::open` takes `Options`**: Pass `write_buffer_size` (and future options) through to the database.
- [x] **L0 flush** (`Db`): When `mem` exceeds `write_buffer_size` after a write, `std::mem::replace` swaps in a fresh
  `Memtable`, the old one is iterated via `MemTableIterator` and flushed to `<path>/<number>.ldb` via `TableBuilder`,
  then the opened `Table` is prepended to `l0_files: Vec<(u64, Table)>`. `scan_next_file_number` prevents file-number
  collisions across sessions. All mutable state lives in `Mutex<DbState>`.
- [x] **`Db::get` disk fallback**: On miss in `mem`, scan `l0_files` newest-first, calling `Table::get` on each.
  `LookupResult::{Value, Deleted, NotInTable}` lets tombstones stop the search without continuing to older files.
  No multi-level routing yet — deferred to Phase 9.

### Phase 4 — Iterators

*Required by `Db::NewIterator` and internally by compaction. Forward `SkipList` iteration, `MemTableIterator`, and
`BlockIterator` were pulled into Phase 3; this phase completes the forward iterator stack. Backward iteration is
deferred to post-parity as it is not needed for compaction or the core read path.*

- [x] **`TwoLevelIterator`** (`src/table/two_level_iterator.rs`): Composes an index iterator and a block-opener
  function, yielding a transparent view over all blocks in an SSTable. `BlockFn` closure captures a cloned `File`
  handle and opens data blocks on demand. `BlockIter` (renamed from `BlockIterator<'a>`) now owns an `Arc<Vec<u8>>`
  so it can be stored without a lifetime parameter. `InternalIterator` trait in `src/iter.rs` provides the shared
  interface used by `TwoLevelIterator`, `MergingIterator`, and `DbIterator`. See `table/two_level_iterator.h/cc`.
- [x] **`MergingIterator`** (`src/db/merge_iter.rs`): N-way merge of sorted iterators using a heap or linear scan
  (LevelDB uses linear for small N). See `table/merger.h/cc`.
- [x] **`DbIterator`** (`src/db/db_iter.rs`): Wraps `MergingIterator`; applies snapshot sequence-number filtering,
  merges value versions, skips tombstones for the user. Forward-only for now. See `db/db_iter.h/cc`.

### Phase 5 — Version management

*Tracks the set of live SSTable files across levels, enables MANIFEST-based recovery, and gates compaction. Open
`Table` handles are stored as `Arc<Table>` directly in `FileMetaData` — no LRU table cache needed for correctness;
that is a post-parity optimisation.*

- [x] **`FileMetaData`** (`src/db/version_edit.rs`): File number, size, smallest/largest internal key bytes,
  `Option<Arc<Table>>` for the open reader (`Some` for all files in a live `Version`; `None` only during MANIFEST
  replay). `FileMetaData::with_table` is the canonical constructor. See `db/version_edit.h`.
- [x] **`VersionEdit`** (`src/db/version_edit.rs`): Atomic delta — files added/removed per level, log number, next
  file number, last sequence. LevelDB-compatible tag-based binary encoding; `table` field is in-memory only and not
  encoded. `encode`/`decode` with forward-compatible unknown-tag handling. See `db/version_edit.h/cc`.
- [x] **`Version`** (`src/db/version.rs`): Snapshot of the LSM structure — `[Vec<Arc<FileMetaData>>; 7]`.
  L0 newest-first; L1–L6 sorted by smallest key. `get(user_key, seq)` scans L0 then L1–L6 linearly (binary
  search deferred to Phase 6). Reference-counted via `Arc<Version>`. See `db/version_set.h/cc`.
- [x] **`VersionSet`** (`src/db/version_set.rs`): Maintains current `Arc<Version>`, MANIFEST log, and
  file-number/sequence state. `create` writes `MANIFEST-000002` + `CURRENT`; `recover` replays MANIFEST via a
  `Builder` that opens Tables for all live files; `log_and_apply` appends to MANIFEST and installs a new `Version`.
  `DbState` replaces `l0_files`/`next_file_number` with `version_set: Option<VersionSet>`; `Db::open` uses
  MANIFEST-driven recovery with WAL replay filtered to `seq > manifest_last_sequence`. See `db/version_set.h/cc`.

### Phase 6 — Full database

*Completes the user-facing API to match LevelDB's `include/leveldb/db.h` and wires all prior phases into a
crash-safe database with multi-level compaction. MANIFEST-driven recovery and level routing are done (Phase 5);
what remains is compaction, the full Iterator/Snapshot API, and operational hygiene (WAL rotation, file GC).*

**Operational correctness**

- [x] **WAL rotation after flush**: `begin_flush` allocates a new WAL file number and creates the new log file
  under the write lock; `finish_flush` calls `vs.set_log_number(new)` before `log_and_apply` so the MANIFEST
  atomically records the new SST and the new log number; then swaps `state.log` and deletes the old WAL.
  Prevents unbounded WAL growth. See `db/db_impl.cc: DBImpl::MakeRoomForWrite`.
- [x] **`DeleteObsoleteFiles`**: After every flush, `delete_obsolete_files` takes the lock briefly to snapshot the
  live `.ldb` set (`VersionSet::add_live_files`), `log_number`, and `manifest_number`, then releases the lock and
  scans the directory. Keep rules: `.log` ≥ `log_number`, `MANIFEST-*` ≥ `manifest_number`, `.ldb` in live set,
  `CURRENT`/`LOCK` always kept. `parse_db_filename` mirrors LevelDB's `ParseFileName`. Called in `Db::write` after
  `finish_flush` drops its lock. See `db/db_impl.cc: DBImpl::RemoveObsoleteFiles`.

**Compaction**

- [x] **L0→L1 compaction**: Triggered when L0 file count ≥ 4 (`L0_COMPACTION_TRIGGER`). `pick_compaction` selects
  all L0 files plus any L1 files whose user-key range overlaps the union L0 range. `do_compaction` (no lock) builds
  a `MergingIterator` over all inputs (L0 newest-first so ties resolve to newer versions), applies shadow-key
  pruning (only the newest version of each user key is kept) and tombstone elision (deletion markers are dropped
  when no data exists at L2+). Writes output SSTables to L1 (new file per `max_file_size`). `install_compaction`
  (under lock) records a single `VersionEdit` deleting all inputs and adding all outputs, then calls
  `log_and_apply`. `log_and_apply` now sorts L1–L6 files by smallest user key after each edit. `maybe_compact`
  orchestrates the three phases; called by the background thread after each flush. Compaction is fully
  asynchronous — a dedicated background thread runs flush and compaction while writers proceed. `Db` is split into
  `Arc<DbInner>` (shared with the background thread) + `JoinHandle` (joined on drop). Write backpressure:
  `make_room_for_write` sleeps 1 ms at L0 ≥ 8 (`L0_SLOWDOWN_WRITES_TRIGGER`), blocks at L0 ≥ 12
  (`L0_STOP_WRITES_TRIGGER`), and waits for an in-progress flush if `imm` is occupied. Level-score scheduling,
  compact-pointer round-robin, seek-based compaction, trivial-move optimisation, and grandparent-overlap limiting
  are all implemented. See `db/db_impl.cc: DBImpl::BackgroundCompaction`, `DoCompactionWork`, `MakeRoomForWrite`.
- [x] **`Db::compact_range(begin, end)`**: Manual compaction of a user-key range (`Option<&[u8]>` for open bounds).
  Finds the deepest level with files overlapping `[begin, end]`, then calls `compact_level_range` for each level
  from 0 to `max_level - 1`.  Each call uses the three-phase lock protocol; newly compacted files are visible to
  subsequent levels' compactions via a fresh version snapshot.  `pick_range_compaction` selects files at `level`
  that overlap `[begin, end]` plus next-level files that overlap the union range; reuses `do_compaction` and
  `install_compaction`.  No-op on in-memory databases.  See `db/db_impl.cc: DBImpl::CompactRange`.

**Iterator and snapshot API** *(all required by `include/leveldb/db.h`)*

- [x] **Backward iteration**: `SkipListIter::seek_to_last()` / `prev()` (scan level-0 from head; no back-pointers),
  `BlockIter::seek_to_last()` (last restart + scan forward) / `prev()` (binary-search restart points, scan to
  predecessor), `TwoLevelIterator::prev()` + `skip_empty_data_blocks_backward()`, `MergingIterator::prev()` /
  `find_largest()` (symmetric to `find_smallest`), `DbIterator::prev()` / `seek_to_last()` / `find_prev_user_entry()`
  (LevelDB-style direction field; Reverse direction stores entry in `saved_key`/`saved_value`; direction switch in
  `next()` handled). `InternalIterator` trait extended with `seek_to_last` + `prev`. Public `DbIter` exposes both.
  26 new unit tests across all layers; 161 tests total. See `table/iterator.h` and `db/db_iter.cc`.
- [x] **`Db::new_iterator(read_opts)`**: Returns `Result<DbIter, Error>`. Pins `Arc<Memtable>` for `mem` and `imm`
  via `ArcMemTableIter` (owned wrapper using `unsafe` transmute of `'a` lifetime, backed by the Arc). Creates
  `TwoLevelIterator` per SSTable from the current `Version`. Builds `MergingIterator` → `DbIterator`. Exposes:
  `valid`, `seek_to_first`, `seek_to_last`, `seek`, `next`, `prev`, `key`, `value`, `status`. `DbState.mem` and
  `.imm` changed to `Arc<Memtable>` to enable Arc cloning for the iterator. See `db/db_impl.cc`.
- [x] **`Db::GetSnapshot` / `ReleaseSnapshot`**: `get_snapshot()` locks, reads `last_sequence`, inserts into
  `DbState::snapshots: BTreeSet<u64>`, returns `Snapshot(seq)`. `release_snapshot(snap)` removes from the set.
  Compaction (`do_compaction`, `maybe_compact`, `compact_level_range`) uses `oldest_snapshot = snapshots.min()
  .unwrap_or(last_sequence)` as the visibility cutoff so entries observable through any live snapshot are not
  pruned. See `db/snapshot.h`.
- [x] **`Db::get` snapshot support**: `Db::get` now takes `&ReadOptions` as first argument. Snapshot seq comes
  from `opts.snapshot.map(|s| s.0).unwrap_or(last_sequence)`. `Memtable::get` changed to take `seq: u64`;
  uses `write_seek_key_to(key, seq)` instead of `write_lookup_to(key)` so reads are snapshot-aware.
  `lookup_size`/`write_lookup_to` gated with `#[cfg(test)]`. 4 new snapshot tests; 165 total.
  See `db/db_impl.cc: DBImpl::Get`.

**Remaining user-facing API surface** *(in `include/leveldb/db.h`)*

- [x] **`Db::GetProperty(property)`**: `get_property(&self, property: &str) -> Option<String>`. Supported
  properties: `"leveldb.num-files-at-level<N>"` (file count at level N), `"leveldb.stats"` (per-level file
  count and size; time/read/write columns emit `0` since compaction stats are not tracked),
  `"leveldb.sstables"` (one line per SSTable with file number, size, and escaped user-key range),
  `"leveldb.approximate-memory-usage"` (memtable bytes). Returns `None` for unknown properties or invalid level
  numbers. Helpers `num_files`, `level_bytes`, `debug_string` added to `Version`.
  See `db/db_impl.cc: DBImpl::GetProperty`.
- [x] **`Db::GetApproximateSizes(ranges)`**: Given a slice of `(start_key, end_key)` ranges, return approximate
  on-disk byte count for each range by walking the index blocks of overlapping SSTables. Used by tools to estimate
  data distribution. `Version::approximate_offset_of` accumulates `file_size` for files entirely before the key
  and delegates to `Table::approximate_offset_of` (index-block seek, returns block start offset or `metaindex_offset`
  as end-of-data proxy) for straddling files. Returns `0` for in-memory databases.
  See `db/db_impl.cc: DBImpl::GetApproximateSizes`.
- [x] **`Db::destroy(path)`**: Associated function (no open instance needed). Acquires `LOCK` non-blocking —
  returns `IoError` immediately if another process holds it. Enumerates the directory via `parse_db_filename`,
  deletes every recognised file except `LOCK`, then drops the flock and deletes `LOCK`. Calls `remove_dir`
  (not `remove_dir_all`) — silently ignores failure so unrecognised files are left in place.
  Returns `Ok(())` if `path` does not exist. See `db/db_impl.cc: DestroyDB`.

**`LOCK` file** *(prerequisite for the above free functions and safe multi-process use)*

- [x] **`LOCK` file**: On `Db::open`, acquire an exclusive `flock` on `<path>/LOCK` to prevent two processes from
  opening the same database simultaneously. Release on `Db::drop`. `acquire_lock` uses `libc::flock(LOCK_EX |
  LOCK_NB)` — returns `IoError` immediately rather than blocking if another process holds the lock. The `File`
  handle is stored as `lock_file: Option<File>` in `Db` (suppressed `dead_code` warning; held for Drop side-effect).
  See `util/env_posix.cc: LockFile`.

**Compression**

- [x] **Bloom filters**: Kirsch-Mitzenmacher double-hashing; `k = round(0.69 * bits_per_key)` hash functions; one filter
  per ~2 KB of data blocks stored in the SSTable filter block. The filter block is written uncompressed and read at
  `Table::open`; `Table::get` checks it after the index-block lookup and before reading or decompressing any data
  block, so a definite-negative skips all I/O and decompression entirely. A `FilterPolicy` trait allows custom
  filters. `Options::filter_policy` wires the chosen policy into flush and compaction.
  See `util/bloom.cc` and `include/leveldb/filter_policy.h`.
- [x] **Block compression**: Wire `Options::compression` and `Options::zstd_compression_level` through `TableBuilder`
  and `Table`.  `write_raw_block` in `src/table/format.rs` compresses data blocks via `snap` (Snappy) or `zstd`
  crates; falls back to NoCompression when compressed output is not at least 12.5% smaller (LevelDB's
  `kCompressionRatio` heuristic).  `read_block` decompresses based on the trailer byte (`0x00` = none, `0x01` =
  Snappy, `0x02` = Zstd).  Filter, metaindex, and index blocks are always written uncompressed (matching LevelDB).
  `TableBuilder::new` accepts `compression: CompressionType` and `zstd_level: i32`; both are passed from flush and
  compaction via `Options`.  Default remains `Snappy` (matching LevelDB).

**Options wiring**

- [x] **`ReadOptions::verify_checksums` + `Options::paranoid_checks`**: Both are fully wired.
  `verify_checksums = opts.verify_checksums || self.options.paranoid_checks` is computed in `Db::get_with_options`
  and `Db::new_iterator` and forwarded to `Version::get(verify_checksums)` and `Table::new_iterator(verify_checksums)`
  (which threads it into the `BlockFn` closure).  `Options::paranoid_checks` is also passed to `VersionSet::recover`
  and `recover_wal` as the `checksum` argument to `logfile::Reader::new`, so MANIFEST and WAL records are verified during
  recovery.  See `db/db_impl.cc: DBImpl::Get`, `table/table.cc: Table::InternalGet`.

---

### Post-parity improvements

*Optimisations and secondary features. None are prerequisites for the phases above. Items marked ✅ are done.*

- ✅ **Table cache**: LRU `TableCache` (`src/db/table_cache.rs`) — capacity
  `max_open_files - NUM_NON_TABLE_CACHE_FILES (10)`, internally synchronized via `Arc<Mutex<Inner>>`. `get_or_open`
  opens on miss and evicts LRU when at capacity; `insert` for newly-built tables; `evict` for compacted-away files.
  `FileMetaData` holds only durable metadata; tables opened lazily on first access. `Db` holds
  `table_cache: Option<TableCache>` (None for in-memory). See `db/table_cache.h/cc`.
- ✅ **`Options::max_open_files`**: Fully wired via `TableCache`. Capacity =
  `max_open_files.saturating_sub(NUM_NON_TABLE_CACHE_FILES).max(1)`. See `db/db_impl.cc: DBImpl::Open`.
- ✅ **Block cache**: Simple LRU `BlockCache` (`src/cache/mod.rs`) keyed by `(cache_id, block_offset)`. Per-table
  `cache_id` from `AtomicU64` prevents cross-file collisions. `Block` is `Clone` (wraps `Arc<Vec<u8>>`), so cache
  hits return a cheap clone. `Table` holds `cache_id: u64` + `Option<Arc<BlockCache>>`; `read_data_block` checks
  cache on hit, reads from disk on miss, inserts when `fill_cache = true`. `new_iterator` block-opener closure
  similarly checks/inserts the cache. Default: 8 MiB (`Options::block_cache`). See `util/cache.cc`.
- ✅ **`ReadOptions::fill_cache`**: Fully wired. `fill_cache = false` skips inserting blocks into the block cache
  (used by `do_compaction` to avoid polluting the cache with bulk-scan data). Threaded through `Db::get_with_options`,
  `Db::new_iterator`, `Version::get`, `Table::get`, and `Table::new_iterator`.
- ✅ **Batch-grouped writes**: Multiple concurrent `Db::write` callers are grouped by a leader that writes a single
  combined WAL record and memtable batch, amortising `fsync` cost over many writers. `DbState` holds
  `writers: VecDeque<WriterSlot>`, `next_writer_id: u64`, and `completed: HashMap<u64, Result<(), Error>>`;
  `Db` holds `write_condvar: Condvar` (paired with `Mutex<DbState>`). Group bounds match LevelDB's
  `BuildBatchGroup`: max 1 MiB; tightened to `first_size + 128 KiB` when the first batch is ≤ 128 KiB; a
  `sync=true` writer won't join a non-sync group. Leader pops followers, stores results in `completed`, calls
  `notify_all`; followers retrieve their result by `id` and return. Leader handles flush/compaction after.
  See `db/db_impl.cc: DBImpl::Write`.
- ✅ **Custom comparator**: `Comparator` trait (`src/comparator.rs`) with `compare`, `name`,
  `find_shortest_separator`, `find_short_successor`. `BytewiseComparator` is the default. `Options::comparator:
  Arc<dyn Comparator>` threaded through all ~40 comparison sites: skip list entry ordering, `cmp_internal_keys`,
  `BlockIter::seek`, `MergingIterator`, `DbIterator`, `Table::get`, `TableBuilder` (index separators shortened via
  `find_shortest_separator`/`find_short_successor`), `VersionSet` file sorting, and all `lib.rs` compaction/overlap
  helpers. `VersionEdit` encodes/decodes the comparator name; `VersionSet::recover` rejects mismatches.
  See `include/leveldb/comparator.h`.
- ✅ **`FileSystem` abstraction**: `FileSystem` trait (`src/env.rs`) with `SequentialFile`, `RandomAccessFile`,
  `WritableFile`, and `FileLock` sub-traits. `PosixFileSystem` is the default (POSIX `std::fs` + `libc::flock`).
  `Options::file_system: Arc<dyn FileSystem>` threaded through all I/O: `LogWriter`, `LogReader`, `Table`,
  `TableBuilder`, `TableCache`, `VersionSet`, and all `lib.rs` functions (flush, compaction, repair, destroy, GC).
  `libc` usage isolated to `PosixFileSystem`. Enables in-memory, encrypted, or cloud storage backends.
  See `include/rocksdb/file_system.h`.
- ✅ **`Db::repair(path, options)`**: Scans the database directory for surviving `.ldb` and `.log` files, converts WALs
  to SSTables via `LogReader` → `Memtable` → `TableBuilder`, extracts metadata from all SSTables, writes a fresh
  `MANIFEST-000001` placing all files at L0, and writes `CURRENT`.  Corrupt records/files are skipped and archived to
  `lost/`.  Port of `db/repair.cc: RepairDB`.
- ✅ **`Options::reuse_logs`**: When `false` (default), replayed WAL data is flushed to an SSTable during `Db::open`
  so the next open doesn't need to replay it.  When `true`, the replayed memtable is kept as the active `mem` and
  the existing WAL is re-opened for appending — faster opens at the cost of replaying the WAL on the next open.
  See `db/db_impl.cc: DBImpl::RecoverLogFile`.
- **`Db::flush_wal(sync: bool)` / `Db::sync_wal()` + manual flush mode**: RocksDB's WAL writer supports two modes.
  In auto-flush mode (default, `manual_flush=false`), `emit_physical_record` flushes the buffer to the OS after every
  `AddRecord` — this is what our `LogWriter` currently does.  In manual-flush mode (`manual_flush=true`), records are
  only appended to the internal buffer; the caller must explicitly call `DB::FlushWAL` to push data to the OS.  Manual
  flush mode enables higher write throughput by batching multiple WAL records into a single OS write.
  `FlushWAL(sync=false)` calls `dest_->Flush()` (push to page cache); `FlushWAL(sync=true)` also `fsync`s.
  `SyncWAL()` is shorthand for `FlushWAL(true)`.  Implementation: add `manual_flush: bool` to `LogWriter`; when
  `true`, skip the `self.dest.flush()` in `emit_physical_record`.  Add `Db::flush_wal(sync)` and `Db::sync_wal()`.
  See `db/log_writer.cc: Writer::AddRecord` (lines 180–183) and `DB::FlushWAL` in `include/rocksdb/db.h`.
- **WAL file recycling** (RocksDB): Reuse old log files from a pool rather than deleting and recreating them, avoiding
  inode churn on slow filesystems. Uses a 12-byte header (standard 7 bytes + `log_number: u32`) so the reader can
  distinguish recycled files. See `db/log_writer.cc` in RocksDB.
- **Ribbon filters** (RocksDB): Drop-in replacement for Bloom filters offering the same false-positive rate in ~30%
  fewer bits. See `util/ribbon_impl.h` in RocksDB.
- **`Db::repair` directory locking** (RocksDB): Acquire the `LOCK` file during repair to prevent concurrent access.
  LevelDB does not lock during repair; RocksDB does (`env_->LockFile` in `db/repair.cc`). Our current implementation
  follows LevelDB and does not lock.
- ✅ **`CompactionFilter` / `CompactionFilterFactory`** (RocksDB): `CompactionFilter` trait
  (`src/compaction_filter.rs`) with `filter(level, key, value, value_type) -> CompactionDecision` returning `Keep`,
  `Remove`, or `ChangeValue(new_val)`.  `CompactionFilterFactory` creates a fresh filter per compaction run
  (allows per-compaction mutable state like timestamps for TTL).  `Options::compaction_filter_factory:
  Option<Arc<dyn CompactionFilterFactory>>` — `None` disables filtering (default).  The filter is called in
  `do_compaction` for the newest visible version of each key, after shadow-key pruning, before writing to the output
  SSTable.  Entries visible to a live snapshot are never filtered.
  See `include/rocksdb/compaction_filter.h`.

---

## Key conventions

- `rustfmt.toml` sets `tab_spaces = 2`.
- `unsafe` blocks must carry a `// SAFETY:` comment; `unsafe fn` must carry a `# Safety` doc section.
- `find_splice_for_level` and `recompute_splice_levels` are free functions (not `SkipList` methods) to avoid a
  simultaneous `&self` / `&mut self.splice` borrow conflict — a pattern to follow whenever a method needs both `&self`
  and `&mut self.field`.
- Methods or functions only used in tests are gated with `#[cfg(test)]`.
- Prefer encoding directly into arena/pre-allocated memory (see `Entry` helpers) over intermediate `Vec` allocations on
  hot paths.
- Run `cargo fmt` on changesets
- Have text and other files hardwrap at 120 character lines