masstree

A high-performance concurrent ordered map for Rust, supporting variable-length keys. This is an experimental branch implementing a major divergence from the original C++ Masstree: WIDTH=24 (vs. the original WIDTH=15). This requires AtomicU128 for the permutation field (24 slots × 5 bits = 120 bits).

The current Rust implementation releases the left leaf lock BEFORE calling propagate_split_generic, breaking the C++ hand-over-hand invariant:

Current Rust (BROKEN):
  handle_leaf_split_generic:
    lock.mark_split()
    split_into()
    link_sibling()
    drop(lock)                   <- WRONG: released too early
    propagate_split_generic()    <- left is unlocked during propagation

C++ Pattern (CORRECT):
  make_split:
    while (true) {
      // n and child are BOTH locked throughout
      p = n->locked_parent()
      // ... insert into parent ...
      // hand-over-hand unlock at END of iteration
      if (n != n_) n->unlock()
      if (child != n_) child->unlock()
    }

Why WIDTH=24?

Larger leaf nodes mean fewer splits under concurrent writes. Benchmarks show:

Workload	WIDTH=15	WIDTH=24	Improvement
16 threads, high contention	17-88ms	6-29ms	~3x faster
32 threads, high contention	46-235ms	11-70ms	~3-4x faster
Single-threaded insert	7.8-9.5ms	13-20ms	slower (larger nodes)

Trade-off: WIDTH=24 excels at concurrent writes with contention but is slower for single-threaded workloads due to larger node sizes.

Branch Name

Named atomic_abstraction because the implementation uses traits (TreeLeaf, TreePermutation, LayerCapableLeaf) to abstract over the atomic type. This makes it easier to generalize to other widths/atomic types in the future.

What This Is

A Rust implementation of the Masstree algorithm (Mao, Kohler, Morris — EUROSYS 2012). MassTree is a trie of B+trees designed for high-throughput concurrent access with variable-length keys.

When to Use MassTree

Use Case	Best Choice	Why
Integer keys (u64)	Congee (ART)	2x faster, specialized for integers
Variable-length keys	MassTree	Congee can't do this
String keys	MassTree	Congee can't do this
Unordered, any keys	DashMap	Simpler, no ordering overhead
Single-threaded ordered	BTreeMap	No concurrency overhead

MassTree's niche: When you need a concurrent ordered map with arbitrary byte-sequence keys and high read throughput.

Why MassTree? Theoretical Strengths

1. Variable-Length Keys with Shared Prefixes

MassTree is a trie of B+trees. Each 8-byte slice of a key is handled by a separate B+tree layer.

Key: "users/alice/profile"
     Layer 0: "users/al" → Layer 1: "ice/prof" → Layer 2: "ile\0\0\0\0\0"

Why this wins:

Keys with common prefixes share tree nodes (cache efficient)
No key comparison beyond the current 8-byte slice
O(key_length/8) layers, each with O(log n) B+tree lookup

Others can't match this because:

B+trees compare full keys at every node
Skip lists compare full keys at every level
Hash maps don't preserve order and can't do range scans

2. Lock-Free Reads with Version Validation

Readers never acquire locks. They:

Read version number
Do work
Validate version unchanged
Retry if changed

Why this wins:

Zero contention on read-heavy workloads
No cache line bouncing from atomic increments (unlike Arc refcounts)
Readers scale linearly with cores

3. Fine-Grained Per-Node Locking

Writers only lock the specific leaf node being modified, not the whole tree or path.

Hot key advantage: When many threads hit the same key:

MassTree: Only that leaf locks, other leaves unaffected
Global locks: Everyone waits

4. Range Scans on Ordered Data

MassTree maintains sorted order within each layer's B+tree. Hash maps can't do this at all. Skip lists can, but with worse cache locality.

Where MassTree is Theoretically Optimal

Workload	Why MassTree Wins
Read-heavy (>90% reads)	Lock-free reads scale perfectly
Hot keys	Per-node locking isolates contention
Long keys with shared prefixes	Trie structure shares work
Mixed key lengths	Layer structure adapts naturally
Range scans needed	Ordered B+tree leaves

Where Others Win Theoretically

Workload	Better Choice	Why
Pure hash lookups	DashMap	O(1) vs O(log n)
Write-dominated, no order needed	DashMap	Simpler write path
Integer keys only	Congee (ART)	Direct byte indexing, no comparisons

Status: Alpha

Not production ready. Core functionality works; APIs may change.

Feature	Status
Concurrent get	Lock-free, version-validated
Concurrent insert	Works (CAS + locked fallback)
Split propagation	Leaf and internode splits
Memory safety	Miri strict provenance clean
Range scans	Planned
Deletion	Planned

312+ tests passing (unit, property-based; some concurrent stress tests are flaky)

Benchmarks

All benchmarks run on the same hardware with --features mimalloc. Median throughput at 32 threads.

The Honest Comparison: MassTree vs Congee (ART)

For 8-byte integer keys, Congee wins decisively:

Structure	32 threads	Scaling (1→32)
Congee (ART)	291.8 Mitem/s	7.8x
MassTree	143.1 Mitem/s	6.6x
skiplist_guarded	88.0 Mitem/s	7.2x
DashMap	75.2 Mitem/s	2.7x

Why ART is faster: Adaptive Radix Trees index directly by key bytes (no comparisons). For fixed-size integer keys, this is optimal. MassTree's B+tree layers add overhead that only pays off with variable-length keys.

Where MassTree Wins

32-byte keys (Congee cannot participate — only supports usize):

Structure	32 threads	vs MassTree
MassTree	108 Mitem/s	—
DashMap	70.5 Mitem/s	1.5x slower
skiplist_guarded	40.5 Mitem/s	2.7x slower
skipmap	37.1 Mitem/s	2.9x slower
indexset	37.2 Mitem/s	2.9x slower

vs Lock-wrapped BTreeMap (8-byte keys, 32 threads):

Structure	Throughput	vs MassTree
MassTree	143 Mitem/s	—
RwLock<BTreeMap>	36 Mitem/s	4x slower
Mutex<BTreeMap>	4 Mitem/s	35x slower

vs Original C++ Masstree

Direct comparison using the same methodology as the C++ reference implementation's rw1 test: 10M random i32 keys, shuffled read access, value verification.

Threads	C++ Masstree	Rust MassTree	Difference
1	1.83 Mops/s	2.12 Mitem/s	+16%
2	3.38 Mops/s	4.15 Mitem/s	+23%
4	5.71 Mops/s	8.02 Mitem/s	+40%
8	9.00 Mops/s	14.05 Mitem/s	+56%
16	11.79 Mops/s	15.29 Mitem/s	+30%
32	13.42 Mops/s	18.04 Mitem/s	+34%

Scaling: C++ 7.3x vs Rust 8.5x (1→32 threads).

Why we're faster: We're simpler, not more optimized. The C++ implementation supports features we lack (deletion, range scans, variable-length key suffixes), and each feature adds overhead. We also benefit from seize's Hyaline memory reclamation (simpler than C++'s epoch-based RCU) and mimalloc.

Caveats: This is one specific workload (read-heavy, shuffled access). The C++ implementation has been battle-tested for over a decade. We haven't done serious optimization work yet (no SIMD in hot path). These numbers may not generalize to your use case.

To reproduce:

# Rust
just apples

# C++ (from reference/ directory)
# Just clone the original repo: https://github.com/kohler/masstree-beta
./configure && make
./mttest -j32 -l 10000000 rw1

Benchmark Caveats

These numbers are best-case for MassTree's design:

Read benchmarks use get_ref(&key, &guard) with a guard pinned once per thread
DashMap's get() creates a guard per call (no batch API available)
crossbeam-skiplist::SkipMap pins/unpins internally per call
Write-heavy benchmarks not shown — MassTree has known overhead for layer creation

Summary

Scenario	Winner
Integer keys, any thread count	Congee (2x faster)
Variable-length keys, high concurrency	MassTree
String keys, high concurrency	MassTree
Unordered hash map needs	DashMap
Single-threaded ordered map	std::collections::BTreeMap

Quick Start

use masstree::MassTree24;
use std::sync::Arc;
use std::thread;

let tree = Arc::new(MassTree24::<u64>::new());

let handles: Vec<_> = (0..8).map(|i| {
    let tree = Arc::clone(&tree);
    thread::spawn(move || {
        let guard = tree.guard();

        // Insert with fine-grained locking
        tree.insert_with_guard(b"user:1000", i, &guard).unwrap();

        // Lock-free read
        if let Some(value) = tree.get_ref(b"user:1000", &guard) {
            println!("Got: {}", *value);
        }
    })
}).collect();

for h in handles { h.join().unwrap(); }

Note: MassTree24 is the WIDTH=24 variant. MassTree<V> (WIDTH=15) is also available for compatibility.

Use mimalloc for Best Performance

#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

How It Works

Trie of B+trees: Keys are split into 8-byte slices. Each slice navigates a B+tree layer. Longer keys chain through layers.

Key: "hello world!"

Layer 0: "hello wo" → B+tree lookup
                ↓
Layer 1: "rld!\0\0\0\0" → B+tree lookup → Value

Optimistic concurrency: Readers check version numbers before and after reading. If a concurrent write occurred, they retry. No locks acquired for reads.

Fine-grained writes: Writers lock only the target leaf node. CAS fast-path for simple inserts avoids locking entirely when possible.

B-link trees: Split operations use sibling pointers for lock-free traversal, allowing reads to proceed during structural modifications.

Divergences from C++ Masstree

This is not a direct port. Key differences from the original C++ implementation:

Aspect	C++ Masstree	Rust MassTree
Memory Reclamation	Epoch-Based RCU	Hyaline via seize
Node Width	15 slots (u64 permuter)	24 slots (u128 permuter)

Why Hyaline over Epoch-Based RCU?

The C++ implementation uses epoch-based reclamation where readers announce entry/exit from critical sections and memory is freed after all readers from an epoch have departed.

We use seize's Hyaline scheme which:

Has simpler API (just Guard and retire)
Provides better worst-case latency (no epoch advancement delays)
Handles nested critical sections naturally
Is the state-of-the-art for Rust concurrent data structures

WIDTH=24 (Novel Optimization)

The original C++ uses WIDTH=15 because the permutation encoding fits in a u64 (15 slots × 4 bits + 4 bits size = 64 bits).

This branch (atomic_abstraction) implements WIDTH=24 using AtomicU128 via portable-atomic:

24 slots × 5 bits + 5 bits size = 125 bits (fits in u128)
60% more capacity per node = fewer splits under concurrent writes
~3x faster for high-contention concurrent writes (see benchmarks at top)

This optimization wasn't practical for the 2012 C++ implementation but is feasible now with modern 128-bit atomic support.

Limitations

No deletion — keys cannot be removed (planned)
No range scans — ordered iteration not yet implemented (planned)
Integer keys — use Congee instead, it's 2x faster
Max key length — 256 bytes
Memory — nodes freed only when tree drops (full reclamation planned)

Running Benchmarks

# WIDTH=24 concurrent benchmarks
cargo bench --bench concurrent_maps24 --features mimalloc

# Full comparison with other structures
cargo bench --bench concurrent_maps --features mimalloc

# Just the throughput scaling benchmarks
cargo bench --bench concurrent_maps --features mimalloc -- 08a_read_scaling
cargo bench --bench concurrent_maps --features mimalloc -- 08b_read_scaling

# vs lock-wrapped BTreeMap
cargo bench --bench lock_comparison --features mimalloc

WIDTH=24 Concurrent Writes (50k ops/thread)

Threads	Fastest	Slowest	Median
1	6.56ms	10.18ms	7.16ms
2	8.12ms	15.96ms	12.14ms
4	13.49ms	16.9ms	14.22ms
8	21.47ms	30.35ms	24.31ms
16	40.37ms	954.8ms	42.16ms
32	78.54ms	983ms	82.64ms

Median times scale well. Occasional outliers at 16+ threads due to contention.

References

License

MIT

Disclaimer

Independent Rust implementation. Not affiliated with or endorsed by the original Masstree authors.

masstree 0.1.5