masstree
A high-performance concurrent ordered map for Rust, supporting variable-length keys. This is an experimental branch implementing a major divergence from the original C++ Masstree: WIDTH=24 (vs. the original WIDTH=15). This requires AtomicU128 for the permutation field (24 slots × 5 bits = 120 bits).
The current Rust implementation releases the left leaf lock BEFORE calling propagate_split_generic, breaking the C++ hand-over-hand invariant:
Current Rust (BROKEN):
handle_leaf_split_generic:
lock.mark_split()
split_into()
link_sibling()
drop(lock) <- WRONG: released too early
propagate_split_generic() <- left is unlocked during propagation
C++ Pattern (CORRECT):
make_split:
while (true) {
// n and child are BOTH locked throughout
p = n->locked_parent()
// ... insert into parent ...
// hand-over-hand unlock at END of iteration
if (n != n_) n->unlock()
if (child != n_) child->unlock()
}
Why WIDTH=24?
Larger leaf nodes mean fewer splits under concurrent writes. Benchmarks show:
| Workload | WIDTH=15 | WIDTH=24 | Improvement |
|---|---|---|---|
| 16 threads, high contention | 17-88ms | 6-29ms | ~3x faster |
| 32 threads, high contention | 46-235ms | 11-70ms | ~3-4x faster |
| Single-threaded insert | 7.8-9.5ms | 13-20ms | slower (larger nodes) |
Trade-off: WIDTH=24 excels at concurrent writes with contention but is slower for single-threaded workloads due to larger node sizes.
Branch Name
Named atomic_abstraction because the implementation uses traits (TreeLeaf, TreePermutation, LayerCapableLeaf) to abstract over the atomic type. This makes it easier to generalize to other widths/atomic types in the future.
What This Is
A Rust implementation of the Masstree algorithm (Mao, Kohler, Morris — EUROSYS 2012). MassTree is a trie of B+trees designed for high-throughput concurrent access with variable-length keys.
When to Use MassTree
| Use Case | Best Choice | Why |
|---|---|---|
| Integer keys (u64) | Congee (ART) | 2x faster, specialized for integers |
| Variable-length keys | MassTree | Congee can't do this |
| String keys | MassTree | Congee can't do this |
| Unordered, any keys | DashMap | Simpler, no ordering overhead |
| Single-threaded ordered | BTreeMap | No concurrency overhead |
MassTree's niche: When you need a concurrent ordered map with arbitrary byte-sequence keys and high read throughput.
Why MassTree? Theoretical Strengths
1. Variable-Length Keys with Shared Prefixes
MassTree is a trie of B+trees. Each 8-byte slice of a key is handled by a separate B+tree layer.
Key: "users/alice/profile"
Layer 0: "users/al" → Layer 1: "ice/prof" → Layer 2: "ile\0\0\0\0\0"
Why this wins:
- Keys with common prefixes share tree nodes (cache efficient)
- No key comparison beyond the current 8-byte slice
- O(key_length/8) layers, each with O(log n) B+tree lookup
Others can't match this because:
- B+trees compare full keys at every node
- Skip lists compare full keys at every level
- Hash maps don't preserve order and can't do range scans
2. Lock-Free Reads with Version Validation
Readers never acquire locks. They:
- Read version number
- Do work
- Validate version unchanged
- Retry if changed
Why this wins:
- Zero contention on read-heavy workloads
- No cache line bouncing from atomic increments (unlike Arc refcounts)
- Readers scale linearly with cores
3. Fine-Grained Per-Node Locking
Writers only lock the specific leaf node being modified, not the whole tree or path.
Hot key advantage: When many threads hit the same key:
- MassTree: Only that leaf locks, other leaves unaffected
- Global locks: Everyone waits
4. Range Scans on Ordered Data
MassTree maintains sorted order within each layer's B+tree. Hash maps can't do this at all. Skip lists can, but with worse cache locality.
Where MassTree is Theoretically Optimal
| Workload | Why MassTree Wins |
|---|---|
| Read-heavy (>90% reads) | Lock-free reads scale perfectly |
| Hot keys | Per-node locking isolates contention |
| Long keys with shared prefixes | Trie structure shares work |
| Mixed key lengths | Layer structure adapts naturally |
| Range scans needed | Ordered B+tree leaves |
Where Others Win Theoretically
| Workload | Better Choice | Why |
|---|---|---|
| Pure hash lookups | DashMap | O(1) vs O(log n) |
| Write-dominated, no order needed | DashMap | Simpler write path |
| Integer keys only | Congee (ART) | Direct byte indexing, no comparisons |
Status: Alpha
Not production ready. Core functionality works; APIs may change.
| Feature | Status |
|---|---|
| Concurrent get | Lock-free, version-validated |
| Concurrent insert | Works (CAS + locked fallback) |
| Split propagation | Leaf and internode splits |
| Memory safety | Miri strict provenance clean |
| Range scans | Planned |
| Deletion | Planned |
312+ tests passing (unit, property-based; some concurrent stress tests are flaky)
Benchmarks
All benchmarks run on the same hardware with --features mimalloc. Median throughput at 32 threads.
The Honest Comparison: MassTree vs Congee (ART)
For 8-byte integer keys, Congee wins decisively:
| Structure | 32 threads | Scaling (1→32) |
|---|---|---|
| Congee (ART) | 291.8 Mitem/s | 7.8x |
| MassTree | 143.1 Mitem/s | 6.6x |
| skiplist_guarded | 88.0 Mitem/s | 7.2x |
| DashMap | 75.2 Mitem/s | 2.7x |
Why ART is faster: Adaptive Radix Trees index directly by key bytes (no comparisons). For fixed-size integer keys, this is optimal. MassTree's B+tree layers add overhead that only pays off with variable-length keys.
Where MassTree Wins
32-byte keys (Congee cannot participate — only supports usize):
| Structure | 32 threads | vs MassTree |
|---|---|---|
| MassTree | 108 Mitem/s | — |
| DashMap | 70.5 Mitem/s | 1.5x slower |
| skiplist_guarded | 40.5 Mitem/s | 2.7x slower |
| skipmap | 37.1 Mitem/s | 2.9x slower |
| indexset | 37.2 Mitem/s | 2.9x slower |
vs Lock-wrapped BTreeMap (8-byte keys, 32 threads):
| Structure | Throughput | vs MassTree |
|---|---|---|
| MassTree | 143 Mitem/s | — |
| RwLock<BTreeMap> | 36 Mitem/s | 4x slower |
| Mutex<BTreeMap> | 4 Mitem/s | 35x slower |
vs Original C++ Masstree
Direct comparison using the same methodology as the C++ reference implementation's rw1 test: 10M random i32 keys, shuffled read access, value verification.
| Threads | C++ Masstree | Rust MassTree | Difference |
|---|---|---|---|
| 1 | 1.83 Mops/s | 2.12 Mitem/s | +16% |
| 2 | 3.38 Mops/s | 4.15 Mitem/s | +23% |
| 4 | 5.71 Mops/s | 8.02 Mitem/s | +40% |
| 8 | 9.00 Mops/s | 14.05 Mitem/s | +56% |
| 16 | 11.79 Mops/s | 15.29 Mitem/s | +30% |
| 32 | 13.42 Mops/s | 18.04 Mitem/s | +34% |
Scaling: C++ 7.3x vs Rust 8.5x (1→32 threads).
Why we're faster: We're simpler, not more optimized. The C++ implementation supports features we lack (deletion, range scans, variable-length key suffixes), and each feature adds overhead. We also benefit from seize's Hyaline memory reclamation (simpler than C++'s epoch-based RCU) and mimalloc.
Caveats: This is one specific workload (read-heavy, shuffled access). The C++ implementation has been battle-tested for over a decade. We haven't done serious optimization work yet (no SIMD in hot path). These numbers may not generalize to your use case.
To reproduce:
# Rust
# C++ (from reference/ directory)
# Just clone the original repo: https://github.com/kohler/masstree-beta
&&
Benchmark Caveats
These numbers are best-case for MassTree's design:
- Read benchmarks use
get_ref(&key, &guard)with a guard pinned once per thread - DashMap's
get()creates a guard per call (no batch API available) crossbeam-skiplist::SkipMappins/unpins internally per call- Write-heavy benchmarks not shown — MassTree has known overhead for layer creation
Summary
| Scenario | Winner |
|---|---|
| Integer keys, any thread count | Congee (2x faster) |
| Variable-length keys, high concurrency | MassTree |
| String keys, high concurrency | MassTree |
| Unordered hash map needs | DashMap |
| Single-threaded ordered map | std::collections::BTreeMap |
Quick Start
use MassTree24;
use Arc;
use thread;
let tree = new;
let handles: = .map.collect;
for h in handles
Note:
MassTree24is the WIDTH=24 variant.MassTree<V>(WIDTH=15) is also available for compatibility.
Use mimalloc for Best Performance
static GLOBAL: MiMalloc = MiMalloc;
How It Works
Trie of B+trees: Keys are split into 8-byte slices. Each slice navigates a B+tree layer. Longer keys chain through layers.
Key: "hello world!"
Layer 0: "hello wo" → B+tree lookup
↓
Layer 1: "rld!\0\0\0\0" → B+tree lookup → Value
Optimistic concurrency: Readers check version numbers before and after reading. If a concurrent write occurred, they retry. No locks acquired for reads.
Fine-grained writes: Writers lock only the target leaf node. CAS fast-path for simple inserts avoids locking entirely when possible.
B-link trees: Split operations use sibling pointers for lock-free traversal, allowing reads to proceed during structural modifications.
Divergences from C++ Masstree
This is not a direct port. Key differences from the original C++ implementation:
| Aspect | C++ Masstree | Rust MassTree |
|---|---|---|
| Memory Reclamation | Epoch-Based RCU | Hyaline via seize |
| Node Width | 15 slots (u64 permuter) | 24 slots (u128 permuter) |
Why Hyaline over Epoch-Based RCU?
The C++ implementation uses epoch-based reclamation where readers announce entry/exit from critical sections and memory is freed after all readers from an epoch have departed.
We use seize's Hyaline scheme which:
- Has simpler API (just
Guardandretire) - Provides better worst-case latency (no epoch advancement delays)
- Handles nested critical sections naturally
- Is the state-of-the-art for Rust concurrent data structures
WIDTH=24 (Novel Optimization)
The original C++ uses WIDTH=15 because the permutation encoding fits in a u64 (15 slots × 4 bits + 4 bits size = 64 bits).
This branch (atomic_abstraction) implements WIDTH=24 using AtomicU128 via portable-atomic:
- 24 slots × 5 bits + 5 bits size = 125 bits (fits in u128)
- 60% more capacity per node = fewer splits under concurrent writes
- ~3x faster for high-contention concurrent writes (see benchmarks at top)
This optimization wasn't practical for the 2012 C++ implementation but is feasible now with modern 128-bit atomic support.
Limitations
- No deletion — keys cannot be removed (planned)
- No range scans — ordered iteration not yet implemented (planned)
- Integer keys — use Congee instead, it's 2x faster
- Max key length — 256 bytes
- Memory — nodes freed only when tree drops (full reclamation planned)
Running Benchmarks
# WIDTH=24 concurrent benchmarks
# Full comparison with other structures
# Just the throughput scaling benchmarks
# vs lock-wrapped BTreeMap
WIDTH=24 Concurrent Writes (50k ops/thread)
| Threads | Fastest | Slowest | Median |
|---|---|---|---|
| 1 | 6.56ms | 10.18ms | 7.16ms |
| 2 | 8.12ms | 15.96ms | 12.14ms |
| 4 | 13.49ms | 16.9ms | 14.22ms |
| 8 | 21.47ms | 30.35ms | 24.31ms |
| 16 | 40.37ms | 954.8ms | 42.16ms |
| 32 | 78.54ms | 983ms | 82.64ms |
Median times scale well. Occasional outliers at 16+ threads due to contention.
References
- Masstree Paper (EUROSYS 2012)
- C++ Reference Implementation
- Congee — Concurrent ART
- seize — Hyaline Memory Reclamation
License
MIT
Disclaimer
Independent Rust implementation. Not affiliated with or endorsed by the original Masstree authors.