skade-katalog 0.1.8

The katalog under skade: an embedded, single-file ACID Apache Iceberg catalog (redb) with time-travel snapshots and atomic multi-table release commits — the Norns recording the world's icebergs.
Documentation

skade

Skaði, the winter queen — she keeps the icebergs in order.

This repo is Skade: a fast, pure-Rust thin layer over iceberg-rust, backed by a ridiculously fast skade-katalog.

  • skade — the data plane: Arrow RecordBatch in, SQL out; gatling multi-core ingest, zero-copy recast, identity-partitioned + compressed writes, V3 tables. (skade/)
  • skade-katalog — the catalog: pure-Rust, redb-backed, single-file ACID iceberg::Catalog. No SQL, no C deps.

skade-katalog

The three Norns — Urðr (past), Verðandi (present), Skuld (future) — recording the world's icebergs in Catalogus Icebergorum / Glacius Type Index

crates.io docs.rs License

A pure-Rust iceberg::Catalog with cross-table transactions. A lock-free static search tree (Ragnar STree64) out front, continuously regenerated from a redb ACID backend. In-process — no network hop, no JVM, no JSON round-trip.

  • table_exists — 2.0M ops/s, ~492× Nessie · ~677× Polaris (0.31 µs p50). Every core reads the front lock-free.
  • Ingest saturates the box — 100% CPU, fully multicore. All cores encode Parquet in parallel; a gatling no-barrier pool keeps the writes overlapping, even over S3.
  • ACID. Every mutation is one redb WriteTransaction. Atomic multi-table commits via RedbCatalog::atomic_release.

Full leaderboard (skade-katalog vs Nessie vs Polaris, every storage variant, all four capabilities) → Benchmarks.

table_exists — catalog reads (ops/sec, log scale · higher is better)

Status

The Iceberg Catalog trait is fully implemented and tested against iceberg = "0.9.1". One gap: schema evolution (the upstream Transaction actions aren't public until iceberg-rust 0.10). See Known shortcomings.

skade — the data-plane companion crate

This repo also ships skade ("winter queen" — Skaði): fast Iceberg table writing/reading + ergonomic DataFusion SQL on top of RedbCatalog. One directory = catalog (catalog.redb) + warehouse; Arrow RecordBatch in, SQL out (wh.sql("SELECT … FROM a JOIN b …")). It packages the writer stack, read-to-Arrow primitive, schema bridges (incl. the unsigned widen/unwiden reinterprets znippy needs), and the windowed-Parquet bulk-ingest helpers that were proven in bench/. Like bench/, it is a detached cargo workspace (skade/), so this crate's manifest and lockfile stay untouched. See skade/README.md.

Read & write paths

Every call resolves left-to-right through these layers, falling through only on a miss. Latest-state reads ride the L1 pointer mirror; time-travel (snapshot-id) reads ride the Ragnar STree64 index. redb is always the source of truth behind them — the in-memory layers are keyed by immutable, content-addressed identifiers, so they can only be evicted, never stale.

Read paths as a circuit board: each catalog function enters from the left and is wired through the in-memory cache layers — the L1 pointer mirror, L0 metadata cache, L1.5 handle cache, and the Ragnar STree64 snapshot index — falling through to the redb source-of-truth rail only on a miss.

layer key → value type populated
L1 pointer mirror table_key → metadata_location ArcSwap<imbl::HashMap<String, Arc<str>, foldhash>> full scan at open, then write-through
L1.5 handle cache metadata_location → Table moka::Cache<String, Table> (capacity-bounded) on load_table miss
L0 metadata cache metadata_location → TableMetadata moka::Cache<String, Arc<TableMetadata>> (byte-bounded, single-flight) on metadata miss
Ragnar static index snapshot_id → (table_key, metadata_location) ArcSwapOption<STree64> warm-built at open, rebuilt by bg compactor (≥1024 commits)
redb source of truth tables · commits · namespaces · namespace_props · meta Mutex<redb::Database> (ACID) every committed write

The same routing in full — including the time-travel, write, and redb-direct paths the diagram leaves out:

LATEST READS
  table_exists(id) ───────► L1 mirror ──hit──► true
                                └─miss─► redb `tables` ─────► bool

  resolve_metadata(id) ───► L1 mirror ──► loc ──► L0 cache ──hit──► Arc<TableMetadata>
       (skips Table::build)     └─miss─► redb         └─miss─► FileIO read + parse JSON

  load_table(id) ─────────► L1 mirror ──► loc ──► L1.5 cache ──hit──► Table  (~100 ns clone)
                                └─miss─► redb          └─miss─► L0 cache ──► Table::build() + insert

TIME-TRAVEL READS  (by snapshot_id: i64)
  load_table_at / resolve_metadata_at(id, sid)
        ──► Ragnar STree64 ──hit──► loc ──► L1.5 / L0 ──► Table / Arc<TableMetadata>
                 └─miss (sid above cutoff)─► redb `commits` live tail ──► loc ──► …

  resolve_many(id, [sid; N])
        ──► Ragnar batch probe (1 pipelined pass) ──► redb `commits` (1 txn, misses only) ──► L0 per loc

WRITES  (create · register · update · drop · rename)
  …──► FileIO write …/<uuid>.metadata.json ──► redb WriteTransaction { tables CAS · commits log · meta++ }
        ──commit──► L1 mirror write-through (insert / remove) ──► maybe rebuild Ragnar (background)

  update_table ──► group-commit: N concurrent commits coalesce into 1 redb txn / 1 fsync   ← commit-burst lever

REDB-DIRECT  (no cache layer)
  list_namespaces · get_namespace · namespace_exists · list_tables ──► redb read txn (range scan)
  create / update / drop_namespace ──────────────────────────────────► redb WriteTransaction

Benchmarks

All numbers below are auto-filled by nornir docs render from the latest nornir bench run (machine/cores/version in each header). Don't hand-edit inside the generated regions — re-run the bench instead. Reproduce with the harness in bench/ (cargo run --bin bench-containers -- up in bench/, then nornir bench run skade-katalog from workspace_skade/).

Catalog read RPC — table_exists (storage-free, runs on every backend)

table_exists is a pure catalog RPC (no object storage), the apples-to-apples comparable; Nessie/Polaris go through the same iceberg::Catalog REST client.

v0.4.11 · oden · 32 cores · 2026-06-12

workload ops_sec p50_us p90_us p99_us mean_us min_us
nessie_table_exists 0.00476M 194.89 249.07 436.37 209.60 148.78
polaris_table_exists 0.00206M 437.06 599.43 1,176 483.85 341.26
skade_katalog_embedded_table_exists 2.03990M 0.33 0.38 0.46 0.36 0.32
skade_katalog_rest_table_exists 0.02006M 51.19 56.73 68.77 49.60 40.85

Embedded read fast paths (nornir-only)

resolve_metadata skips Table::build() (the lock-free L1+L0 floor); load_table is the full Catalog trait path after the L1.5 handle cache.

v0.4.11 · oden · 32 cores · 2026-06-12

workload ops_sec p50_us p90_us p99_us mean_us min_us
skade_katalog_embedded_load_table 0.943M 0.83 0.86 4.99 0.91 0.79
skade_katalog_embedded_resolve_metadata 1.141M 0.64 0.67 4.76 0.73 0.61

Data plane — single-writer / many-processor ingest

All cores encode Parquet; one writer streams + commits (data-pipe). Backend- agnostic and storage-bound (shared client-side Parquet + S3), so rows/s converge across backends — this lifts all boats, it is not a catalog lever. Nessie runs the same shared RustFS S3 warehouse. Scan verifies the round-trip. The nornir local-FS row runs two file destinations_nvme (PCIe-4.0 NVMe) and _ram (/dev/shm tmpfs) — to expose the pure storage floor.

v0.4.11 · oden · 32 cores · 2026-06-12

workload rows_per_sec files
data_pipe_nessie 9.79M 10
data_pipe_skade_nvme 34.24M 10
data_pipe_skade_ram 31.13M 10
data_pipe_skade_s3 14.06M 10

Commit-bursty — the catalog lever (group-commit)

Many small concurrent metadata commits — where skade-katalog coalesces commits into one redb txn/fsync. This is where the embedded catalog genuinely pulls ahead of a REST/JVM catalog (throughput and tail latency).

v0.4.11 · oden · 32 cores · 2026-06-12

workload commits_per_sec p50_us p90_us p99_us mean_us min_us
commit_burst_nessie 1,032 5,945 48,676 308,507 24,391 2,444
commit_burst_skade_nvme 3,982 7,986 9,400 10,104 8,030 367.31
commit_burst_skade_ram 5,600 5,786 6,921 7,744 5,703 349.89
commit_burst_skade_s3 3,081 10,328 11,803 13,703 10,319 2,292

TPC-H — analytical SQL (all 22 queries via DataFusion)

The full 8-table TPC-H schema loaded into Iceberg tables, queried through DataFusion (iceberg-datafusion) — per-query latency for all 22 canonical queries over the catalog's scan/manifest path. Authentic tpchgen data; TPCH_SF sets the scale (small by default). Answer values aren't validated here (that needs SF=1) — this measures plan + scan + execute latency end-to-end.

v0.4.11 · oden · 32 cores · 2026-06-12

workload max_us mean_us min_us ops_sec p50_us p90_us p999_us p99_us rows
tpch_q01 25,604 21,208 18,551 47.15 20,886 25,604 25,604 25,604 4
tpch_q02 44,228 42,709 41,896 23.41 42,435 44,228 44,228 44,228 4
tpch_q03 20,312 19,356 17,786 51.66 19,494 20,312 20,312 20,312 138
tpch_q04 15,695 15,192 14,492 65.82 15,115 15,695 15,695 15,695 5
tpch_q05 51,879 51,056 50,395 19.59 50,957 51,879 51,879 51,879 5
tpch_q06 4,139 3,850 3,666 259.72 3,817 4,139 4,139 4,139 1
tpch_q07 39,286 37,982 36,533 26.33 37,649 39,286 39,286 39,286 4
tpch_q08 60,772 59,150 57,838 16.91 58,860 60,772 60,772 60,772 2
tpch_q09 60,964 57,787 53,757 17.31 57,911 60,964 60,964 60,964 173
tpch_q10 32,870 32,594 32,327 30.68 32,610 32,870 32,870 32,870 399
tpch_q11 16,629 15,518 14,823 64.44 15,137 16,629 16,629 16,629 359
tpch_q12 19,659 19,124 18,247 52.29 19,326 19,659 19,659 19,659 2
tpch_q13 14,732 14,137 13,681 70.74 13,998 14,732 14,732 14,732 33
tpch_q14 7,848 7,429 7,126 134.62 7,398 7,848 7,848 7,848 1
tpch_q15 18,803 17,261 16,595 57.94 16,784 18,803 18,803 18,803 1
tpch_q16 25,185 23,537 22,865 42.49 23,175 25,185 25,185 25,185 296
tpch_q17 28,684 27,725 26,519 36.07 27,962 28,684 28,684 28,684 1
tpch_q18 38,520 37,450 36,285 26.70 37,376 38,520 38,520 38,520 2
tpch_q19 14,961 13,854 13,308 72.18 13,566 14,961 14,961 14,961 1
tpch_q20 28,227 27,712 27,216 36.09 27,738 28,227 28,227 28,227 1
tpch_q21 46,422 45,390 43,762 22.03 46,199 46,422 46,422 46,422 1
tpch_q22 14,853 13,396 12,107 74.65 13,574 14,853 14,853 14,853 7

TPC-H large warehouse (opt-in, big iron)

A full-scale run: ingest the 8-table warehouse at a large scale factor across all cores (partitioned parallel ingest), then run all 22 queries over it — reporting build vs query time, ingest rate, and the slowest query. Runs by default at a moderate scale; TPCH_WAREHOUSE_SF tunes it (default 10 ≈ 60 M lineitem rows / a few min; 100-200 for a 10-60 min big-iron workout), TPCH_WAREHOUSE=0 disables it, TPCH_WAREHOUSE_PARTS sets ingest partitions (default = all cores).

v0.4.11 · oden · 32 cores · 2026-06-12

workload build_s ingest_rows ingest_rows_per_sec parts query_s result_rows scale_factor slowest_query slowest_query_s total_s
tpch_warehouse 3.42 86.59M 25.33M 32 21.33 534,307 10 17 3.09 24.74

TPC-H across catalogs × storage (skade-katalog vs Nessie vs Polaris)

The same SF warehouse built and all 22 queries run over each (catalog × storage) combination through DataFusion — the analytical-SQL path compared apples-to-apples (not just the table_exists RPC). The matrix covers every combo that's possible: nornir on file-NVMe, file-RAM, and S3; Nessie S3-only (it rejects a local file warehouse); Polaris on file (its S3 credential vending 301s against the RustFS endpoint — see bench/src/containers.rs). The REST targets need their container + the shared RustFS S3 warehouse (bench-containers up all), else they skip. TPCH_COMPARE_SF sets the scale (default 1.0 ≈ 8.66M rows).

Parallel scan patch: stock iceberg-datafusion 0.9.1 scans a table as a single DataFusion partition (one serial Parquet-decode stream), so on a 32-core box only ~1 core works and queries are ~7-8× slower than they should be. The repo vendors a patched iceberg-datafusion (bench/vendor/, relative-path [patch] used by both bench/ and skade/) that exposes one partition per file, so DataFusion decodes files across all cores. SF100 FILE query dropped ~1660s → ~197s (8.4×). These numbers reflect that patch; published skade builds get stock 0.9.1 until the patch is upstreamed.

At SF100 (866M rows) both S3-backed paths (skade_s3, nessie_s3) fail with hyper: connection closed before message completed finishing the Parquet writer — an S3 write-robustness limit under sustained heavy PUT load (rust-s3 / RustFS), not a catalog issue; the FILE-backed paths complete cleanly.

v0.4.11 · oden · 32 cores · 2026-06-12

workload build_s q01_s q02_s q03_s q04_s q05_s q06_s q07_s q08_s q09_s q10_s q11_s q12_s q13_s q14_s q15_s q16_s q17_s q18_s q19_s q20_s q21_s q22_s query_s rows scale_factor slowest_query slowest_query_s total_s
tpch_cmp_nessie_s3 1.86 0.48 0.43 0.67 0.61 0.81 0.43 0.75 0.86 1.06 0.69 0.35 0.63 0.25 0.47 0.89 0.22 1.04 1.23 0.49 0.63 1.61 0.25 14.80 8.66M 1 21 1.61 16.67
tpch_cmp_polaris_file 0.86 0.07 0.12 0.08 0.05 0.24 0.03 0.17 0.26 0.35 0.09 0.09 0.08 0.07 0.04 0.06 0.06 0.23 0.26 0.06 0.07 0.24 0.05 2.77 8.66M 1 9 0.35 3.63
tpch_cmp_skade_file_nvme 0.46 0.06 0.11 0.06 0.03 0.23 0.02 0.17 0.27 0.36 0.08 0.07 0.07 0.06 0.02 0.04 0.05 0.21 0.24 0.05 0.07 0.22 0.03 2.51 8.66M 1 9 0.36 2.97
tpch_cmp_skade_file_ram 0.43 0.06 0.11 0.06 0.04 0.21 0.02 0.16 0.26 0.34 0.08 0.07 0.07 0.06 0.02 0.04 0.04 0.21 0.22 0.05 0.06 0.23 0.03 2.46 8.66M 1 9 0.34 2.89
tpch_cmp_skade_s3 0.64 0.14 0.19 0.18 0.14 0.35 0.09 0.25 0.38 0.45 0.22 0.19 0.19 0.09 0.11 0.20 0.09 0.33 0.39 0.11 0.17 0.45 0.08 4.79 8.66M 1 9 0.45 5.43

OSM GeoParquet ingest (data plane)

End-to-end ingest of authentic OSM data: a nodes.parquet produced by the katana-osm / osm2geoparquet converter (OSM .osm/.bz2/.pbf → GeoParquet) is read, its schema mapped to Iceberg, and ingested into a fresh RedbCatalog table via the single-writer / many-processor pipeline, then scanned back to verify. NVMe vs RAM destinations isolate storage cost. Point OSM_GEOPARQUET at the file (OSM_MAX_ROWS caps rows); the osm_ingest_* benchers record a failure if it's unset, like the container-backed targets.

(no bench results)

ZSTD decode — zstd-sys-rs vs the zstd crate

Aggregate (all-core) decompression throughput on a real OSM corpus (WKB geometry bytes from the converted Europe extract, ~parquet-page-sized frames), decoding the same frames two ways: the stock zstd crate (one-shot, allocates per frame) and zstd-sys-rs's zero-copy path (a reused ZSTD_DCtx + reused output buffer). Both link the same static libzstd 1.5.7, so this validates the bindings and the zero-copy API on real data rather than chasing a codec difference. Needs OSM_GEOPARQUET (ZSTD_CHUNK_KB sets frame size).

(no bench results)

Quickstart

use std::collections::HashMap;
use std::sync::Arc;

use iceberg::io::LocalFsStorageFactory;
use iceberg::spec::{NestedField, PrimitiveType, Schema, Type};
use iceberg::{Catalog, CatalogBuilder, NamespaceIdent, TableCreation};
use skade_katalog::RedbCatalogBuilder;

# async fn run() -> anyhow::Result<()> {
let catalog = RedbCatalogBuilder::default()
    .db_path("/var/lib/nornir/catalog.redb")
    .warehouse_location("file:///var/lib/nornir/warehouse")
    .with_storage_factory(Arc::new(LocalFsStorageFactory))
    .load("nornir", HashMap::new())
    .await?;

let ns = NamespaceIdent::new("bench".to_string());
catalog.create_namespace(&ns, HashMap::new()).await?;

let schema = Schema::builder()
    .with_schema_id(0)
    .with_fields(vec![
        NestedField::required(1, "run_id", Type::Primitive(PrimitiveType::String)).into(),
        NestedField::required(2, "ops_sec", Type::Primitive(PrimitiveType::Double)).into(),
    ])
    .build()?;

let table = catalog
    .create_table(
        &ns,
        TableCreation::builder()
            .name("bench_runs".to_string())
            .schema(schema)
            .build(),
    )
    .await?;
# Ok(()) }

Multi-table atomic commits

Iceberg's per-table Transaction::commit only gives single-table atomicity. When a release logically spans several tables (e.g. publishing bench_runs, dep_graph, and components from a single CI run), readers can otherwise observe a half-published state.

redb gives every write transaction global atomicity across all of its tables. RedbCatalog::atomic_release(commits) exploits this: each TableCommit is staged (metadata blobs written to object storage), then a single redb write transaction performs an optimistic-concurrency check on every base pointer and either flips them all or none.

       prepare metadata files       single redb txn
TableCommit ──┐                    ┌─ check base = current ┐
TableCommit ──┼─► fileio.write_to ─┤  …for every table…    ├─► commit
TableCommit ──┘                    └─ swap pointer         ┘

If iceberg-rust later exposes TableCommit construction publicly, the ergonomic story improves. Until then, atomic_release is most useful when you build commits via crates that have direct access to internal helpers (e.g. a nornir release driver).

Storage layout

Inside the redb file:

keyspace key value
namespaces <catalog>\x1f<ns_path> empty marker
namespace_props <catalog>\x1f<ns_path>\x1f<prop> property value
tables <catalog>\x1f<ns_path>\x1f<table> metadata location

Where <ns_path> is the dot-joined namespace ident ("a.b.c"). One redb file may host multiple logical catalogs by reusing different <catalog> prefixes; in practice you'll point each catalog at its own file.

Known shortcomings (0.1.3)

  • Schema evolution is not callable yet. iceberg-rust 0.9.1 seals the TransactionAction trait (pub(crate)), so downstream code can't construct AddSchema / SetCurrentSchema. The on-disk metadata already supports schema evolution — only the writer API is sealed — so it works here unchanged once upstream exposes it (0.10). These Transaction actions do work today: update_table_properties, fast_append, replace_sort_order, update_location, update_statistics, upgrade_format_version.

  • No orphan-file garbage collection. A commit that fails after writing metadata blobs to object storage but before the redb pointer-swap commits leaves those blobs behind — harmless but unreferenced. Reap them with Iceberg's standard orphan-file procedures; there is no built-in sweeper.

License

Licensed under Apache License, Version 2.0, matching the upstream Apache Iceberg project.