merutable
An embeddable single-table engine where the data is both row and columnar and metadata is Iceberg-compatible. One logical table backed by an LSM-tree. L0 SSTables are Parquet tuned as a rowstore (8 KiB pages, PLAIN encoding — optimized for point lookups). L1+ SSTables are Parquet tuned as a columnstore (128 KiB pages, per-column encoding — optimized for scans). Deletion vectors are Apache Iceberg v3 deletion-vector-v1 Puffin blobs, and each commit writes a native JSON manifest that is a strict superset of Apache Iceberg v2 TableMetadata — losslessly projectable onto a spec-compliant Iceberg table via merutable::iceberg::translate and exportable on demand with db.export_iceberg(target_dir). DuckDB, Spark, Trino, and pyiceberg read the exported view; the Parquet files themselves are the single source of truth — no ETL, no format conversion. Analytical query execution happens in those external engines, not inside merutable — see docs/EXTERNAL_READS.md for the reader contract.
Named after the Meru Parvatha from Indian mythology.
Why merutable
- One table, two workloads: Write rows through a KV API, query them with SQL. Same Parquet files, zero ETL.
- Open format, no lock-in: Data is Parquet. Metadata is Iceberg. DuckDB, Spark, Trino, Snowflake read it natively via the exported Iceberg manifest — apply the MVCC dedup projection (docs/EXTERNAL_READS.md) once and any v2-aware reader sees consistent results.
- Embed, don't deploy: Link one crate. No server, no cluster, no JVM.
Architecture
IcebergCatalog manages the single table's native manifest chain (metadata/v{N}.metadata.json + version-hint.text). The manifest is a superset of Iceberg v2 TableMetadata — every field Iceberg needs (table_uuid, last_updated_ms, parent_snapshot_id, sequence_number, schemas[]) is already stored, so projection is a pure function with no data loss. Catalog integration (Hive, Glue, REST, etc.) is an external layer on top — merutable provides the table artifacts, not the catalog service.
Write path: Under WAL lock: sequence allocate → WAL append → memtable insert → advance visible_seq. Writes back off when the L0 file count or immutable-memtable queue crosses configured thresholds (graduated slowdown → hard stop). Flush when threshold crossed. Each flush writes a new Parquet SST (fsynced), then commits a new v{N}.metadata.json and installs a new Version via ArcSwap. MeruDB::open spawns configurable background workers (flush_parallelism, compaction_parallelism) so flushes and compactions run continuously without needing caller intervention.
Read path: Memtable (active + immutable queue) → L0 files (bloom → KvSparseIndex page skip → scan) → L1..LN (bloom → KvSparseIndex → binary search).
Compaction: Leveled compaction, always full-rewrite — every selected input is fully read, merged, and removed from the manifest. Compactions on disjoint level sets run in parallel (L0→L1 and L2→L3 concurrently) via per-level reservation; the catalog commit phase is serialized by a separate brief lock. L0 is prioritized above the slowdown trigger so deep compactions can't starve flush drainage. Each job's input is capped by max_compaction_bytes to bound per-worker memory. Output SST is fsynced before manifest commit. Fully compacted files are removed from the manifest and queued for deferred deletion; GC waits on BOTH a time-based grace (for external analytical readers like DuckDB / Spark) AND version-pin refcounts (for internal get/scan holding an old Version) — a long integrity scan on tens of GB will never see a file it needs disappear mid-read. Row cache is cleared on every compaction commit. Snapshot-aware version dropping preserves MVCC versions needed by active readers.
Deletion Vectors (read-side live, write-side dormant): merutable's format reads and applies any Puffin deletion-vector-v1 blob it finds on a source file (Iceberg v3 compatibility — an external writer can stamp a DV on data merutable owns, and merutable will honor it on the next compaction merge). merutable itself does not produce DVs today, because every compaction is a full rewrite — there is no residual source file to stamp. The write-side API (SnapshotTransaction::add_dv + Puffin v3 encoder + post-union cardinality validation) is implemented and tested; it waits for a partial-compaction caller (RFC #19). This is intentional simplicity, not a missing feature — see docs/SEMANTICS.md.
Iceberg translation: Call db.export_iceberg(target_dir) at any time to emit a spec-compliant Iceberg v2 metadata.json under target_dir/metadata/v{N}.metadata.json plus a version-hint.text. Spec compliance is pinned by a CI test that round-trips the emitted JSON through the iceberg-rs TableMetadata deserializer. Manifest-list / manifest Avro file emission is tracked as follow-on work; for inspection, catalog registration, and schema audit the metadata.json alone is sufficient.
Crate map
Issue #38 collapsed the workspace into a single published crate.
Internal modules under crates/merutable/src/ retain the same
responsibility split:
| Module | Responsibility |
|---|---|
merutable::types |
InternalKey encoding, TableSchema, FieldValue, SeqNum, OpType, MeruError |
merutable::wal |
32 KiB block format WAL with CRC32, recovery, rotation |
merutable::memtable |
crossbeam skip-list memtable, bumpalo arena, rotation, flow control |
merutable::parquet |
Parquet SSTable writer/reader, FastLocalBloom, KvSparseIndex, footer KV metadata |
merutable::iceberg |
Native JSON manifest + VersionSet (ArcSwap) + DeletionVector (Puffin v3 deletion-vector-v1) + translate module projecting snapshots onto Apache Iceberg v2 TableMetadata. Not a catalog — catalog integration (Hive, Glue, REST) is external. |
merutable::store |
Pluggable object store: local FS, S3, LRU disk cache |
merutable::engine |
FlushJob, CompactionJob, MergingIterator, RowCache, read/write paths |
merutable::sql |
Change-feed cursor + DataFusion TableProvider (feature sql, on by default) |
merutable::replica |
Scale-out RO replica with hot-swap rebase (feature replica, depends on sql) |
merutable (root) |
Public embedding API: MeruDB, OpenOptions, ScanIterator, MirrorConfig |
The PyO3 bindings live in crates/merutable-python/ (structurally
separate because Python extensions must be a cdylib).
Storage tuning
The LSM tree uses level-aware Parquet tuning to serve both OLTP and OLAP workloads:
| Level | Row group | Page size | Encoding | Tuning biased for |
|---|---|---|---|---|
| L0 | 4 MiB | 8 KiB | PLAIN (all columns) | Rowstore — point lookups, memtable flush |
| L1 | 32 MiB | 32 KiB | Per-column (see below) | Warm — transitional |
| L2+ | 128 MiB | 128 KiB | Per-column (see below) | Columnstore — analytics scans |
Per-column encoding at L1+:
_merutable_ikey(lookup key): PLAIN — zero-overhead decode for point lookupsInt32/Int64: DELTA_BINARY_PACKED — optimal for sorted integer columnsFloat/Double: BYTE_STREAM_SPLIT — IEEE 754 byte-transpositionByteArray(strings): RLE_DICTIONARY — high compression for categorical dataBoolean: RLE
L0 files carry both _merutable_ikey + _merutable_value (postcard blob for KV fast-path) and typed columns. L1+ files drop the blob and store only _merutable_ikey + typed columns — the analytical format external engines read.
Quick start
use ;
use ;
use FieldValue;
async
Interactive notebook
The lab/lab_merutable.ipynb notebook is a live, runnable showcase — open it on GitHub to see pre-rendered outputs, or run it locally for the full interactive experience:
&&
The notebook covers: write/flush/inspect, compaction with Deletion Vectors, external analytical reads from DuckDB (SQL queries on merutable's Parquet files — zero ETL), acceleration structures (bloom filter + KvSparseIndex), and write/read performance benchmarks.
Python bindings
merutable ships a PyO3 crate (merutable-python) that exposes the full API to Python:
=
= # {'id': 1, 'name': 'alice', 'score': 95.5, 'active': True}
# Batch writes — single WAL sync per batch, 100-1000× faster than individual puts
# → L0 Parquet file + new v{N}.metadata.json
# → L1 columnstore + Deletion Vectors (Puffin v3)
# includes cache hit/miss counters
# External analytical reads: register merutable's Iceberg metadata with DuckDB.
# Always go through the Iceberg manifest — raw-glob `read_parquet` picks
# up files still in the GC grace window and skips the MVCC dedup
# projection, producing wrong answers on any non-trivial workload.
# See docs/EXTERNAL_READS.md for the canonical projection.
# metadata.json + version-hint
# Read-only replica — opens same catalog, no WAL, no writes
=
# reads from Parquet files
# picks up new snapshots from the primary
# Hand the current snapshot to any Iceberg v2 reader
# writes metadata/v{N}.metadata.json
# Then, e.g.:
# from pyiceberg.table import StaticTable
# t = StaticTable.from_metadata("/tmp/events-iceberg/metadata/v1.metadata.json")
# t.schema() # ← full Iceberg v2 schema, table_uuid, snapshot chain
# Graceful shutdown — flush + fsync + seal
# writes are rejected after this; reads still work
Build with maturin:
&&
License
Apache-2.0