Bloom_lfs — LLAMA Log-Structured Storage
A high-performance, latch-free log-structured storage layer built for modern flash storage and multi-core systems.
Overview
Bloom_lfs implements the storage foundation of LLAMA (Latch-free, Log-structured, Access-Method Aware) — a concurrent caching and storage subsystem designed to exploit the performance characteristics of append-only writes on flash media while avoiding the costs of random I/O and excessive write amplification.
The project currently implements the log-structured secondary storage component, which provides:
- High write throughput through batched, sequential writes
- Latch-free concurrency via atomic operations on packed state words
- Low write amplification with tail-localized append patterns
- Direct I/O (
O_DIRECT) to bypass kernel page cache overhead - Asynchronous I/O via
io_uringfor efficient kernel interactions
Architecture
Core Components
┌─────────────────────────────────────────────────────────────┐
│ FlushBufferRing │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Buffer 0 │ │ Buffer 1 │ │ Buffer 2 │ │ Buffer 3 │ │
│ │ 4 KiB │ │ 4 KiB │ │ 4 KiB │ │ 4 KiB │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ ▲ │
│ └─── current_buffer (atomic pointer) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────┐
│ FlushBehavior │
│ ┌─────────────────┐ │
│ │ io_uring │ │
│ │ (async I/O) │ │
│ └─────────────────┘ │
└───────────────────────┘
│
▼
┌───────────────────────┐
│ LogStructuredStore │
│ (O_DIRECT backing │
│ file on disk) │
└───────────────────────┘
1. FlushBuffer & FlushBufferRing (flush_buffer.rs)
A fixed-size ring of 4 KiB-aligned buffers that amortizes individual writes into batched sequential I/O operations.
Key Features:
- Latch-free writes: Single packed
AtomicUsizestate word per buffer eliminates locks - Concurrent filling: Multiple threads write to one buffer simultaneously via atomic fetch-add
- Automatic rotation: Buffers seal when full and trigger asynchronous flush
O_DIRECTcompatibility: All allocations aligned to 4 KiB page boundaries
State Word Layout:
┌────────────────┬────────────────┬──────────────┬─────────────┬─────────┐
│ Bits 63..32 │ Bits 31..8 │ Bits 7..2 │ Bit 1 │ Bit 0 │
│ write offset │ writer count │ (reserved) │ flushing │ sealed │
└────────────────┴────────────────┴──────────────┴─────────────┴─────────┘
2. LogStructuredStore (log_structured_store.rs)
The durable backing file for page state. All writes flow through the FlushBufferRing and are dispatched via FlushBehavior.
Key Features:
- Stability tracking:
hi_stablepointer tracks highest contiguous durable slot - Out-of-order handling:
completed_islandsset manages non-contiguous completions - Crash recovery: Can rebuild consistent state from
hi_stablewatermark - Write amplification: Near-zero — buffers are written once at their assigned slots
Address Space:
byte_offset = local_lss_address_slot × FOUR_KB_PAGE
Slots are claimed atomically at buffer seal time, guaranteeing no two buffers ever map to the same file region.
3. FlushBehavior (flush_behaviour.rs)
Handles io_uring-backed write dispatch with two operating modes:
| Mode | Description | Use Case |
|---|---|---|
| TailLocalizedWrites | Parallel writes within bounded tail region | High-throughput ingestion |
| SerializedWrites | Strict append order (IO_LINK flag) |
WAL segments, recovery logs |
Tail-Localized Writes:
- Buffers write concurrently; flushes may complete out-of-order
- Maximum write distance from tail:
RING_SIZE × FOUR_KB_PAGE - Suitable for page-state storage where absolute ordering is not critical
Serialized Writes:
- Each write waits for the previous to complete (
IO_LINK) - Enforces submission-order on disk
- Reduced parallelism but guaranteed ordering
Write Path
Caller
│
│ write_payload(&[u8], Reservation)
▼
reserve_space(payload_size) ← atomic fetch-add on state word
│
▼
FlushBufferRing::put() ← copy payload into aligned buffer
│
├─ buffer not full ───────────────────► Ok(SuccessfulWrite)
│
└─ buffer full (sealed)
│
├─ rotate ring to next available buffer
│
└─ FlushBehavior::submit_buffer()
│
├─ NoWaitAppender (TailLocalizedWrites)
│ └─ io_uring write_at(offset) ← unordered
│
└─ WaitAppender (SerializedWrites)
└─ io_uring write_at(offset, IO_LINK) ← strict order
Design Rationale
Why Log-Structured?
Flash-Optimized:
- Sequential writes exploit flash erase-block characteristics
- Reduces write amplification (each page written once)
- Enables wear leveling across the device
Concurrency-Friendly:
- No in-place updates → no contention on individual slots
- Latch-free buffer management via atomic operations
- Caller threads participate in flush without blocking
Recovery-Efficient:
hi_stablewatermark provides instant crash-consistent snapshot- No need to scan entire file during recovery
- Out-of-order completions handled transparently via island set
Why O_DIRECT?
- Bypasses kernel page cache — DMA directly to device
- Predictable latency — no cache eviction surprises
- Lower memory pressure — userspace controls all buffering
Tradeoff: Requires strict alignment (4 KiB) for all buffers.
Why io_uring?
- Asynchronous I/O — submit returns immediately, no blocking
- Batched operations — submit multiple SQEs in one syscall
- Zero-copy — kernel reads directly from userspace buffers
- Polled completions — no context switch per completion
Planned Components
The full LLAMA design includes (not yet implemented):
- Mapping Tables: Lock-free in-memory index for locating page deltas
- Caching Layer: Physical storage for hot deltas in memory
- Recovery Protocol: Rebuild mapping table after crashes
This repository currently implements the log-structured secondary storage foundation.
Usage Example
use ;
// Open store with tail-localized writes (high throughput)
let store = open_with_behavior?;
// Reserve space, write payload, poll completions
let reservation = store.reserve_space?;
store.write_payload?;
store.check_async_cque; // Process completed writes
// For WAL or recovery log, use serialized mode
let wal = open_with_behavior?;
Performance Characteristics
| Metric | Value |
|---|---|
| Buffer size | 4 KiB (aligned to page boundary) |
| Ring size | 4 buffers (default, configurable) |
| Max tail write distance | RING_SIZE × 4 KiB = 16 KiB (tail-localized) |
| Concurrency model | Latch-free (atomic operations only) |
| Write amplification | ~1.0 (each page written once) |
| I/O pattern | Sequential append (flash-optimized) |
Requirements
- Linux kernel ≥ 5.1 (for
io_uringsupport) - Rust ≥ 1.70 (for atomic operations and
UnsafeCellpatterns)
Dependencies
[]
= "0.6"
= "0.12"
= "0.2"
Safety
All unsafe code is audited and justified:
FlushBuffer::write: Raw pointer dereference into aligned allocation- Invariant:
reserve_spacegrants exclusive, non-overlapping ranges
- Invariant:
FlushBufferRing::current_buffer: Raw atomic pointer load- Invariant: Ring guarantees pointer validity for store lifetime
io_uringSQE construction: Pointers stored in user_data- Invariant: Buffers pinned in memory, flush-in-progress bit prevents reuse
Future Work
- Implement mapping tables (lock-free hash table)
- Add caching layer (LRU eviction policy)
- Recovery protocol (replay log from
hi_stable) - Benchmarking suite (vs. baseline file I/O)
- Delta compression (reduce storage overhead)
- Remote storage support (networked backing file)
References
- LLAMA Paper: Lev-Ari et al., "LLAMA: A Cache/Storage Subsystem for Modern Hardware" (VLDB 2013)
- io_uring: Axboe, Jens. "Efficient IO with io_uring" (kernel.dk)
- Log-Structured Storage: Rosenblum & Ousterhout, "The Design and Implementation of a Log-Structured File System" (SOSP 1991)
License
[Specify your license here]
Authors
[Specify authors/contributors here]