skade
Skaði, the winter queen — she keeps the icebergs in order.
This repo is Skade: a fast, pure-Rust thin layer over iceberg-rust,
backed by a ridiculously fast skade-katalog.
skade— the data plane: ArrowRecordBatchin, SQL out; gatling multi-core ingest, zero-copy recast, identity-partitioned + compressed writes, V3 tables. (skade/)skade-katalog— the catalog: pure-Rust, redb-backed, single-file ACIDiceberg::Catalog. No SQL, no C deps.
skade-katalog

A pure-Rust iceberg::Catalog with cross-table transactions. A lock-free static search tree (Ragnar STree64) out front, continuously regenerated from a redb ACID backend. In-process — no network hop, no JVM, no JSON round-trip.
table_exists— 2.0M ops/s, ~492× Nessie · ~677× Polaris (0.31 µs p50). Every core reads the front lock-free.- Ingest saturates the box — 100% CPU, fully multicore. All cores encode Parquet in parallel; a gatling no-barrier pool keeps the writes overlapping, even over S3.
- ACID. Every mutation is one redb
WriteTransaction. Atomic multi-table commits viaRedbCatalog::atomic_release.
Full leaderboard (skade-katalog vs Nessie vs Polaris, every storage variant, all four capabilities) → Benchmarks.
Status
The Iceberg Catalog trait is fully implemented and tested against iceberg = "0.9.1". One gap: schema evolution (the upstream Transaction actions aren't public until iceberg-rust 0.10). See Known shortcomings.
skade — the data-plane companion crate
This repo also ships skade ("winter queen" — Skaði): fast Iceberg
table writing/reading + ergonomic DataFusion SQL on top of RedbCatalog. One
directory = catalog (catalog.redb) + warehouse; Arrow RecordBatch in, SQL
out (wh.sql("SELECT … FROM a JOIN b …")). It packages the writer stack,
read-to-Arrow primitive, schema bridges (incl. the unsigned widen/unwiden
reinterprets znippy needs), and the windowed-Parquet bulk-ingest helpers that
were proven in bench/. Like bench/, it is a detached cargo
workspace (skade/), so this crate's manifest and lockfile stay untouched.
See skade/README.md.
Read & write paths
Every call resolves left-to-right through these layers, falling through only on a
miss. Latest-state reads ride the L1 pointer mirror; time-travel
(snapshot-id) reads ride the Ragnar STree64 index. redb is always the source of
truth behind them — the in-memory layers are keyed by immutable, content-addressed
identifiers, so they can only be evicted, never stale.
| layer | key → value | type | populated |
|---|---|---|---|
| L1 pointer mirror | table_key → metadata_location |
ArcSwap<imbl::HashMap<String, Arc<str>, foldhash>> |
full scan at open, then write-through |
| L1.5 handle cache | metadata_location → Table |
moka::Cache<String, Table> (capacity-bounded) |
on load_table miss |
| L0 metadata cache | metadata_location → TableMetadata |
moka::Cache<String, Arc<TableMetadata>> (byte-bounded, single-flight) |
on metadata miss |
| Ragnar static index | snapshot_id → (table_key, metadata_location) |
ArcSwapOption<STree64> |
warm-built at open, rebuilt by bg compactor (≥1024 commits) |
| redb source of truth | tables · commits · namespaces · namespace_props · meta |
Mutex<redb::Database> (ACID) |
every committed write |
The same routing in full — including the time-travel, write, and redb-direct paths the diagram leaves out:
LATEST READS
table_exists(id) ───────► L1 mirror ──hit──► true
└─miss─► redb `tables` ─────► bool
resolve_metadata(id) ───► L1 mirror ──► loc ──► L0 cache ──hit──► Arc<TableMetadata>
(skips Table::build) └─miss─► redb └─miss─► FileIO read + parse JSON
load_table(id) ─────────► L1 mirror ──► loc ──► L1.5 cache ──hit──► Table (~100 ns clone)
└─miss─► redb └─miss─► L0 cache ──► Table::build() + insert
TIME-TRAVEL READS (by snapshot_id: i64)
load_table_at / resolve_metadata_at(id, sid)
──► Ragnar STree64 ──hit──► loc ──► L1.5 / L0 ──► Table / Arc<TableMetadata>
└─miss (sid above cutoff)─► redb `commits` live tail ──► loc ──► …
resolve_many(id, [sid; N])
──► Ragnar batch probe (1 pipelined pass) ──► redb `commits` (1 txn, misses only) ──► L0 per loc
WRITES (create · register · update · drop · rename)
…──► FileIO write …/<uuid>.metadata.json ──► redb WriteTransaction { tables CAS · commits log · meta++ }
──commit──► L1 mirror write-through (insert / remove) ──► maybe rebuild Ragnar (background)
update_table ──► group-commit: N concurrent commits coalesce into 1 redb txn / 1 fsync ← commit-burst lever
REDB-DIRECT (no cache layer)
list_namespaces · get_namespace · namespace_exists · list_tables ──► redb read txn (range scan)
create / update / drop_namespace ──────────────────────────────────► redb WriteTransaction
Benchmarks
All numbers below are auto-filled by nornir docs render from the latest
nornir bench run (machine/cores/version in each header). Don't hand-edit inside
the generated regions — re-run the bench instead. Reproduce with the harness in
bench/ (cargo run --bin bench-containers -- up in bench/, then
nornir bench run skade-katalog from workspace_skade/).
Catalog read RPC — table_exists (storage-free, runs on every backend)
table_exists is a pure catalog RPC (no object storage), the apples-to-apples
comparable; Nessie/Polaris go through the same iceberg::Catalog REST client.
v0.4.11 · oden · 32 cores · 2026-06-12
| workload | ops_sec | p50_us | p90_us | p99_us | mean_us | min_us |
|---|---|---|---|---|---|---|
| nessie_table_exists | 0.00476M | 194.89 | 249.07 | 436.37 | 209.60 | 148.78 |
| polaris_table_exists | 0.00206M | 437.06 | 599.43 | 1,176 | 483.85 | 341.26 |
| skade_katalog_embedded_table_exists | 2.03990M | 0.33 | 0.38 | 0.46 | 0.36 | 0.32 |
| skade_katalog_rest_table_exists | 0.02006M | 51.19 | 56.73 | 68.77 | 49.60 | 40.85 |
Embedded read fast paths (nornir-only)
resolve_metadata skips Table::build() (the lock-free L1+L0 floor);
load_table is the full Catalog trait path after the L1.5 handle cache.
v0.4.11 · oden · 32 cores · 2026-06-12
| workload | ops_sec | p50_us | p90_us | p99_us | mean_us | min_us |
|---|---|---|---|---|---|---|
| skade_katalog_embedded_load_table | 0.943M | 0.83 | 0.86 | 4.99 | 0.91 | 0.79 |
| skade_katalog_embedded_resolve_metadata | 1.141M | 0.64 | 0.67 | 4.76 | 0.73 | 0.61 |
Data plane — single-writer / many-processor ingest
All cores encode Parquet; one writer streams + commits (data-pipe). Backend-
agnostic and storage-bound (shared client-side Parquet + S3), so rows/s converge
across backends — this lifts all boats, it is not a catalog lever. Nessie
runs the same shared RustFS S3 warehouse. Scan verifies the round-trip. The
nornir local-FS row runs two file destinations — _nvme (PCIe-4.0 NVMe) and
_ram (/dev/shm tmpfs) — to expose the pure storage floor.
v0.4.11 · oden · 32 cores · 2026-06-12
| workload | rows_per_sec | files |
|---|---|---|
| data_pipe_nessie | 9.79M | 10 |
| data_pipe_skade_nvme | 34.24M | 10 |
| data_pipe_skade_ram | 31.13M | 10 |
| data_pipe_skade_s3 | 14.06M | 10 |
Commit-bursty — the catalog lever (group-commit)
Many small concurrent metadata commits — where skade-katalog coalesces commits into one redb txn/fsync. This is where the embedded catalog genuinely pulls ahead of a REST/JVM catalog (throughput and tail latency).
v0.4.11 · oden · 32 cores · 2026-06-12
| workload | commits_per_sec | p50_us | p90_us | p99_us | mean_us | min_us |
|---|---|---|---|---|---|---|
| commit_burst_nessie | 1,032 | 5,945 | 48,676 | 308,507 | 24,391 | 2,444 |
| commit_burst_skade_nvme | 3,982 | 7,986 | 9,400 | 10,104 | 8,030 | 367.31 |
| commit_burst_skade_ram | 5,600 | 5,786 | 6,921 | 7,744 | 5,703 | 349.89 |
| commit_burst_skade_s3 | 3,081 | 10,328 | 11,803 | 13,703 | 10,319 | 2,292 |
TPC-H — analytical SQL (all 22 queries via DataFusion)
The full 8-table TPC-H schema loaded into Iceberg tables, queried through
DataFusion (iceberg-datafusion) — per-query latency for all 22 canonical
queries over the catalog's scan/manifest path. Authentic tpchgen data;
TPCH_SF sets the scale (small by default). Answer values aren't validated here
(that needs SF=1) — this measures plan + scan + execute latency end-to-end.
v0.4.11 · oden · 32 cores · 2026-06-12
| workload | max_us | mean_us | min_us | ops_sec | p50_us | p90_us | p999_us | p99_us | rows |
|---|---|---|---|---|---|---|---|---|---|
| tpch_q01 | 25,604 | 21,208 | 18,551 | 47.15 | 20,886 | 25,604 | 25,604 | 25,604 | 4 |
| tpch_q02 | 44,228 | 42,709 | 41,896 | 23.41 | 42,435 | 44,228 | 44,228 | 44,228 | 4 |
| tpch_q03 | 20,312 | 19,356 | 17,786 | 51.66 | 19,494 | 20,312 | 20,312 | 20,312 | 138 |
| tpch_q04 | 15,695 | 15,192 | 14,492 | 65.82 | 15,115 | 15,695 | 15,695 | 15,695 | 5 |
| tpch_q05 | 51,879 | 51,056 | 50,395 | 19.59 | 50,957 | 51,879 | 51,879 | 51,879 | 5 |
| tpch_q06 | 4,139 | 3,850 | 3,666 | 259.72 | 3,817 | 4,139 | 4,139 | 4,139 | 1 |
| tpch_q07 | 39,286 | 37,982 | 36,533 | 26.33 | 37,649 | 39,286 | 39,286 | 39,286 | 4 |
| tpch_q08 | 60,772 | 59,150 | 57,838 | 16.91 | 58,860 | 60,772 | 60,772 | 60,772 | 2 |
| tpch_q09 | 60,964 | 57,787 | 53,757 | 17.31 | 57,911 | 60,964 | 60,964 | 60,964 | 173 |
| tpch_q10 | 32,870 | 32,594 | 32,327 | 30.68 | 32,610 | 32,870 | 32,870 | 32,870 | 399 |
| tpch_q11 | 16,629 | 15,518 | 14,823 | 64.44 | 15,137 | 16,629 | 16,629 | 16,629 | 359 |
| tpch_q12 | 19,659 | 19,124 | 18,247 | 52.29 | 19,326 | 19,659 | 19,659 | 19,659 | 2 |
| tpch_q13 | 14,732 | 14,137 | 13,681 | 70.74 | 13,998 | 14,732 | 14,732 | 14,732 | 33 |
| tpch_q14 | 7,848 | 7,429 | 7,126 | 134.62 | 7,398 | 7,848 | 7,848 | 7,848 | 1 |
| tpch_q15 | 18,803 | 17,261 | 16,595 | 57.94 | 16,784 | 18,803 | 18,803 | 18,803 | 1 |
| tpch_q16 | 25,185 | 23,537 | 22,865 | 42.49 | 23,175 | 25,185 | 25,185 | 25,185 | 296 |
| tpch_q17 | 28,684 | 27,725 | 26,519 | 36.07 | 27,962 | 28,684 | 28,684 | 28,684 | 1 |
| tpch_q18 | 38,520 | 37,450 | 36,285 | 26.70 | 37,376 | 38,520 | 38,520 | 38,520 | 2 |
| tpch_q19 | 14,961 | 13,854 | 13,308 | 72.18 | 13,566 | 14,961 | 14,961 | 14,961 | 1 |
| tpch_q20 | 28,227 | 27,712 | 27,216 | 36.09 | 27,738 | 28,227 | 28,227 | 28,227 | 1 |
| tpch_q21 | 46,422 | 45,390 | 43,762 | 22.03 | 46,199 | 46,422 | 46,422 | 46,422 | 1 |
| tpch_q22 | 14,853 | 13,396 | 12,107 | 74.65 | 13,574 | 14,853 | 14,853 | 14,853 | 7 |
TPC-H large warehouse (opt-in, big iron)
A full-scale run: ingest the 8-table warehouse at a large scale factor across all
cores (partitioned parallel ingest), then run all 22 queries over it — reporting
build vs query time, ingest rate, and the slowest query. Runs by default at a
moderate scale; TPCH_WAREHOUSE_SF tunes it (default 10 ≈ 60 M lineitem rows /
a few min; 100-200 for a 10-60 min big-iron workout), TPCH_WAREHOUSE=0 disables
it, TPCH_WAREHOUSE_PARTS sets ingest partitions (default = all cores).
v0.4.11 · oden · 32 cores · 2026-06-12
| workload | build_s | ingest_rows | ingest_rows_per_sec | parts | query_s | result_rows | scale_factor | slowest_query | slowest_query_s | total_s |
|---|---|---|---|---|---|---|---|---|---|---|
| tpch_warehouse | 3.42 | 86.59M | 25.33M | 32 | 21.33 | 534,307 | 10 | 17 | 3.09 | 24.74 |
TPC-H across catalogs × storage (skade-katalog vs Nessie vs Polaris)
The same SF warehouse built and all 22 queries run over each (catalog × storage)
combination through DataFusion — the analytical-SQL path compared apples-to-apples
(not just the table_exists RPC). The matrix covers every combo that's possible:
nornir on file-NVMe, file-RAM, and S3; Nessie S3-only (it rejects a local
file warehouse); Polaris on file (its S3 credential vending 301s against the
RustFS endpoint — see bench/src/containers.rs). The REST targets need their
container + the shared RustFS S3 warehouse (bench-containers up all), else they
skip. TPCH_COMPARE_SF sets the scale (default 1.0 ≈ 8.66M rows).
Parallel scan patch: stock
iceberg-datafusion0.9.1 scans a table as a single DataFusion partition (one serial Parquet-decode stream), so on a 32-core box only ~1 core works and queries are ~7-8× slower than they should be. The repo vendors a patchediceberg-datafusion(bench/vendor/, relative-path[patch]used by bothbench/andskade/) that exposes one partition per file, so DataFusion decodes files across all cores. SF100 FILE query dropped ~1660s → ~197s (8.4×). These numbers reflect that patch; publishedskadebuilds get stock 0.9.1 until the patch is upstreamed.
At SF100
(866M rows) both S3-backed paths (skade_s3, nessie_s3) fail with hyper: connection closed before message completed finishing the Parquet writer — an S3
write-robustness limit under sustained heavy PUT load (rust-s3 / RustFS), not a
catalog issue; the FILE-backed paths complete cleanly.
v0.4.11 · oden · 32 cores · 2026-06-12
| workload | build_s | q01_s | q02_s | q03_s | q04_s | q05_s | q06_s | q07_s | q08_s | q09_s | q10_s | q11_s | q12_s | q13_s | q14_s | q15_s | q16_s | q17_s | q18_s | q19_s | q20_s | q21_s | q22_s | query_s | rows | scale_factor | slowest_query | slowest_query_s | total_s |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tpch_cmp_nessie_s3 | 1.86 | 0.48 | 0.43 | 0.67 | 0.61 | 0.81 | 0.43 | 0.75 | 0.86 | 1.06 | 0.69 | 0.35 | 0.63 | 0.25 | 0.47 | 0.89 | 0.22 | 1.04 | 1.23 | 0.49 | 0.63 | 1.61 | 0.25 | 14.80 | 8.66M | 1 | 21 | 1.61 | 16.67 |
| tpch_cmp_polaris_file | 0.86 | 0.07 | 0.12 | 0.08 | 0.05 | 0.24 | 0.03 | 0.17 | 0.26 | 0.35 | 0.09 | 0.09 | 0.08 | 0.07 | 0.04 | 0.06 | 0.06 | 0.23 | 0.26 | 0.06 | 0.07 | 0.24 | 0.05 | 2.77 | 8.66M | 1 | 9 | 0.35 | 3.63 |
| tpch_cmp_skade_file_nvme | 0.46 | 0.06 | 0.11 | 0.06 | 0.03 | 0.23 | 0.02 | 0.17 | 0.27 | 0.36 | 0.08 | 0.07 | 0.07 | 0.06 | 0.02 | 0.04 | 0.05 | 0.21 | 0.24 | 0.05 | 0.07 | 0.22 | 0.03 | 2.51 | 8.66M | 1 | 9 | 0.36 | 2.97 |
| tpch_cmp_skade_file_ram | 0.43 | 0.06 | 0.11 | 0.06 | 0.04 | 0.21 | 0.02 | 0.16 | 0.26 | 0.34 | 0.08 | 0.07 | 0.07 | 0.06 | 0.02 | 0.04 | 0.04 | 0.21 | 0.22 | 0.05 | 0.06 | 0.23 | 0.03 | 2.46 | 8.66M | 1 | 9 | 0.34 | 2.89 |
| tpch_cmp_skade_s3 | 0.64 | 0.14 | 0.19 | 0.18 | 0.14 | 0.35 | 0.09 | 0.25 | 0.38 | 0.45 | 0.22 | 0.19 | 0.19 | 0.09 | 0.11 | 0.20 | 0.09 | 0.33 | 0.39 | 0.11 | 0.17 | 0.45 | 0.08 | 4.79 | 8.66M | 1 | 9 | 0.45 | 5.43 |
OSM GeoParquet ingest (data plane)
End-to-end ingest of authentic OSM data: a nodes.parquet produced by the
katana-osm / osm2geoparquet converter (OSM .osm/.bz2/.pbf → GeoParquet)
is read, its schema mapped to Iceberg, and ingested into a fresh RedbCatalog
table via the single-writer / many-processor pipeline, then scanned back to
verify. NVMe vs RAM destinations isolate storage cost. Point OSM_GEOPARQUET at
the file (OSM_MAX_ROWS caps rows); the osm_ingest_* benchers record a failure
if it's unset, like the container-backed targets.
(no bench results)
ZSTD decode — zstd-sys-rs vs the zstd crate
Aggregate (all-core) decompression throughput on a real OSM corpus (WKB geometry
bytes from the converted Europe extract, ~parquet-page-sized frames), decoding
the same frames two ways: the stock zstd crate (one-shot, allocates per frame)
and zstd-sys-rs's zero-copy path (a reused ZSTD_DCtx + reused output buffer).
Both link the same static libzstd 1.5.7, so this validates the bindings and the
zero-copy API on real data rather than chasing a codec difference. Needs
OSM_GEOPARQUET (ZSTD_CHUNK_KB sets frame size).
(no bench results)
Quickstart
use HashMap;
use Arc;
use LocalFsStorageFactory;
use ;
use ;
use RedbCatalogBuilder;
# async
Multi-table atomic commits
Iceberg's per-table Transaction::commit only gives single-table atomicity.
When a release logically spans several tables (e.g. publishing bench_runs,
dep_graph, and components from a single CI run), readers can otherwise
observe a half-published state.
redb gives every write transaction global atomicity across all of its
tables. RedbCatalog::atomic_release(commits) exploits this: each
TableCommit is staged (metadata blobs written to object storage), then a
single redb write transaction performs an optimistic-concurrency check on
every base pointer and either flips them all or none.
prepare metadata files single redb txn
TableCommit ──┐ ┌─ check base = current ┐
TableCommit ──┼─► fileio.write_to ─┤ …for every table… ├─► commit
TableCommit ──┘ └─ swap pointer ┘
If iceberg-rust later exposes TableCommit construction publicly, the
ergonomic story improves. Until then, atomic_release is most useful when
you build commits via crates that have direct access to internal helpers
(e.g. a nornir release driver).
Storage layout
Inside the redb file:
| keyspace | key | value |
|---|---|---|
namespaces |
<catalog>\x1f<ns_path> |
empty marker |
namespace_props |
<catalog>\x1f<ns_path>\x1f<prop> |
property value |
tables |
<catalog>\x1f<ns_path>\x1f<table> |
metadata location |
Where <ns_path> is the dot-joined namespace ident ("a.b.c"). One redb
file may host multiple logical catalogs by reusing different <catalog>
prefixes; in practice you'll point each catalog at its own file.
Known shortcomings (0.1.3)
-
Schema evolution is not callable yet.
iceberg-rust 0.9.1seals theTransactionActiontrait (pub(crate)), so downstream code can't constructAddSchema/SetCurrentSchema. The on-disk metadata already supports schema evolution — only the writer API is sealed — so it works here unchanged once upstream exposes it (0.10). These Transaction actions do work today:update_table_properties,fast_append,replace_sort_order,update_location,update_statistics,upgrade_format_version. -
No orphan-file garbage collection. A commit that fails after writing metadata blobs to object storage but before the redb pointer-swap commits leaves those blobs behind — harmless but unreferenced. Reap them with Iceberg's standard orphan-file procedures; there is no built-in sweeper.
License
Licensed under Apache License, Version 2.0, matching the upstream Apache Iceberg project.