Hexane
Hexane is a columnar compression library that implements the encoding described in the Automerge Binary Format specification. It stores sequences of typed values using run-length encoding (RLE), delta encoding, and other compact representations, organized in a slab-based B-tree structure for efficient random access and in-place modification.
It was originally designed to serve Automerge's internal storage needs. Introducing it reduced Automerge's memory footprint to roughly 1% of the previous implementation while keeping performance equal to or better than the old codebase for most operations.
Note: The public API is still maturing. A v1.0 redesign is planned that will give
ColumnDataa moreVec-like interface and remove some ergonomic rough edges. See Planned v1.0 Changes for details.
Data Model
A ColumnData<C> stores a sequence of Option<T> values, where T is determined by the
cursor type C. None represents a null entry and is compressed efficiently as a null run.
Internally, data is held in a sequence of slabs — immutable, Arc-wrapped byte buffers
containing compressed runs of values. The slabs are organized in a SpanTree (a balanced
B-tree keyed by cumulative item count) that supports O(log n) positional seek, insert, and
splice. Each slab carries pre-computed metadata (item count, accumulator sum, min/max) so that
range queries and accumulator-based navigation can skip over slabs without decoding them.
Encoding Types
The cursor type parameter of ColumnData<C> determines the encoding:
| Cursor type | Item type | Encoding |
|---|---|---|
UIntCursor |
u64 |
RLE with unsigned LEB128 values |
IntCursor |
i64 |
RLE with signed LEB128 values |
StrCursor |
str |
RLE with length-prefixed UTF-8 |
ByteCursor |
[u8] |
RLE with length-prefixed bytes |
BooleanCursor |
bool |
Boolean run-length encoding |
DeltaCursor |
i64 |
Delta-encoded signed integers |
RawCursor |
[u8] |
Uncompressed raw bytes |
For custom types, use RleCursor<B, T> directly where T: Packable and B is the maximum
slab size in bytes.
Quick Start
use ;
use Cow;
// splice(index, delete_count, insert_iter) is the general-purpose mutation method.
let mut col: = new;
col.splice;
// Collect from an iterator — works with T or Option<T> interchangeably
let col2: = .into_iter.collect;
assert_eq!;
// Nullable columns: mix Some and None in the input
let col3: = .into_iter.collect;
assert_eq!;
Reading Values
All read methods return Option<Cow<'_, C::Item>>:
Some(Cow::Owned(value))— an in-bounds non-null value (copy types likeu64,i64)Some(Cow::Borrowed(&value))— an in-bounds non-null value that borrows from the slab (used byStrCursorandByteCursorto avoid copying)None— a null entry
get(index) returns an extra outer Option that is None when the index is out of bounds.
use ;
use Cow;
let mut col: = new;
col.splice;
// Random access — O(log n) seek + O(B) decode where B is runs-per-slab (small)
assert_eq!;
assert_eq!; // out of bounds returns outer None
// Sequential iteration — most efficient; decoder state is amortized across a slab
let first_three: = col.iter.take.collect;
assert_eq!;
// Ranged iteration — O(log n) seek to start, then O(range) decode
let mid: = col.iter_range.collect;
assert_eq!;
StrCursor and ByteCursor borrow directly from the internal slab where possible:
use ;
use Cow;
let col: = .into_iter.collect;
assert_eq!; // zero-copy
Serialization
use ;
let mut col: = new;
col.splice;
// save() serializes to a new Vec<u8>
let bytes: = col.save;
// save_to() appends to an existing buffer and returns the byte range written
let mut buf = vec!;
let range = col.save_to;
// load() deserializes; returns PackError if the data is malformed
let col2: = load.unwrap;
assert_eq!;
// load_unless_empty() treats an empty byte slice as a column of `len` null values
let col3: = load_unless_empty.unwrap;
assert_eq!;
// load_with() validates each decoded value with a callback; returns PackError on failure
let col4 = load_with.unwrap;
Direct Encoding
For one-shot encoding without ColumnData, cursor types expose encode and
encode_unless_empty class methods that write directly to a byte buffer:
use ;
let mut buf = vec!;
let word_range = encode;
assert_eq!;
// encode_unless_empty writes nothing if all values are null/false
let bool_range = encode_unless_empty;
assert_eq!; // nothing written
// For incremental encoding, use Encoder directly
let mut encoder: = default;
for word in
let _range = encoder.save_to;
Advanced: Accumulators (Acc)
Several ColumnData and iterator methods deal with an Acc accumulator. Acc is a
monotonically non-decreasing u64 — the cumulative sum of per-item Agg values. The
aggregate assigned to each item depends on the cursor type:
| Cursor | agg(item) |
Meaning of Acc |
|---|---|---|
UIntCursor |
the item value (clamped to u32) |
cumulative sum of values |
IntCursor |
the item value if it fits in u32 |
cumulative sum of values |
BooleanCursor |
1 for true, 0 for false |
count of true entries |
StrCursor |
0 | unused |
ByteCursor |
0 | unused |
DeltaCursor |
the absolute item value (if fits in u32) |
absolute position tracking |
get_acc(index) returns the accumulated sum of all items before position index:
use ;
let col: = from;
assert_eq!;
assert_eq!;
assert_eq!;
get_with_acc(index) returns a ColGroupItem { acc, pos, item } bundling the pre-item
accumulator with the value.
iter().with_acc() returns a ColGroupIter that emits ColGroupItem values.
iter().advance_acc_by(n) skips forward until the cumulative accumulator has grown by n,
returning the number of items consumed. This enables O(log n + B) lookup by acc value:
use ;
// Column: 0, 1, 1, 0, 1, 1, 0 — acc grows at positions 1, 2, 4, 5
let col: = from;
let mut iter = col.iter;
let items_consumed = iter.advance_acc_by; // skip until acc has grown by 2
// items_consumed == 4 (items 0,1,2 consumed, item 3 is the first with acc sum >= 2)
Advanced: Ordered Value Lookup
For columns with sorted values, scope_to_value and the iterator's seek_to_value efficiently
locate the contiguous range of a specific value using B-tree binary search plus a linear scan:
use ;
// Values must be sorted for correct results
let col: = .into_iter.collect;
let range = col.scope_to_value;
assert_eq!;
// Restrict search to a sub-range
let range = col.scope_to_value;
assert_eq!;
iter().seek_to_value(value, range) does the same and leaves the iterator positioned at
the start of the found range, ready to read further.
Advanced: Iterator Suspension and Resumption
An iterator can be suspended and later resumed from the exact same position, provided
ColumnData has not been mutated since the suspend:
use ;
let col: = .into_iter.collect;
let mut iter = col.iter;
iter.advance_by; // skip items 0 and 1
// Snapshot the iterator position
let state = iter.suspend;
// Resume — returns Err if `col` was mutated after the suspend
let mut resumed = state.try_resume.unwrap;
assert_eq!;
Performance Characteristics
- Random access (
get): O(log n) B-tree lookup + O(B) slab scan, where B is the number of encoded runs in a slab (bounded and small by design). - Sequential access (
iter,iter_range): O(log n) initial seek; amortized O(1) per item — the decode state is carried across items within a slab. - Modification (
splice,push,extend): O(log n) B-tree lookup plus an O(slab-size) slab rewrite. The affected slab is replaced entirely; unaffected slabs are untouched. - Clone: O(number of slabs) — slabs are
Arc-wrapped, so cloning shares byte data. - Serialization (
save): O(n) — all slabs are read and re-encoded into a single buffer.
For bulk reads, strongly prefer iter() or iter_range() over repeated get() calls.
For bulk inserts at a known position, a single splice() is more efficient than many push().
Caveats
Clone for rollback: Because slabs are Arc-wrapped, cloning a ColumnData is cheap
(O(slab count)). Automerge exploits this by cloning before each transaction and dropping the
clone on rollback. This behaviour will likely be removed in v1.0 in favour of an explicit
inverse-splice rollback.
Null semantics: is_empty() considers a column "empty" if every value is None (or
false for BooleanCursor). This affects save_to_unless_empty() and friends.
Slab size constant: The B const parameter in RleCursor<B, T> controls the maximum
number of items per slab. Smaller values reduce the per-operation cost but increase B-tree
overhead. The defaults (64 or 128) are a reasonable starting point.
Planned v1.0 Changes
- A
Vec-like subscript API:col[a..b]as shorthand forcol.iter_range(a..b) - Remove pervasive
Cowreturn types in favour of direct&T/ ownedT - A cleaner
DeltaCursorprimitive based onDelta { abs: i64, step: i64 } - Distinct cursor types for
Rle<u64>,Rle<Option<u64>>, etc. - Normalized min/max metadata (currently computed for all slabs, only needed for some)
- Transaction rollback via inverse splice rather than
Arc-shared clone - A SortedColumnData type so features like
seek_to_valueare not available on datasets that cant support it