Bloom_lfs — LLAMA Log-Structured Storage

A high-performance, latch-free log-structured storage layer built for modern flash storage and multi-core systems.

Overview

Bloom_lfs implements the storage foundation of LLAMA (Latch-free, Log-structured, Access-Method Aware) — a concurrent caching and storage subsystem designed to exploit the performance characteristics of append-only writes on flash media while avoiding the costs of random I/O and excessive write amplification.

The project currently implements the log-structured secondary storage component, which provides:

High write throughput through batched, sequential writes
Latch-free concurrency via atomic operations on packed state words
Low write amplification with tail-localized append patterns
Direct I/O (O_DIRECT) to bypass kernel page cache overhead
Asynchronous I/O via io_uring for efficient kernel interactions

Architecture

Core Components

┌─────────────────────────────────────────────────────────────┐
│                    FlushBufferRing                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │
│  │ Buffer 0 │  │ Buffer 1 │  │ Buffer 2 │  │ Buffer 3 │   │
│  │  4 KiB   │  │  4 KiB   │  │  4 KiB   │  │  4 KiB   │   │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │
│      ▲                                                       │
│      └─── current_buffer (atomic pointer)                   │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │   FlushBehavior       │
              │  ┌─────────────────┐  │
              │  │ io_uring        │  │
              │  │ (async I/O)     │  │
              │  └─────────────────┘  │
              └───────────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │ LogStructuredStore    │
              │  (O_DIRECT backing    │
              │   file on disk)       │
              └───────────────────────┘

1. FlushBuffer & FlushBufferRing (`flush_buffer.rs`)

A fixed-size ring of 4 KiB-aligned buffers that amortizes individual writes into batched sequential I/O operations.

Key Features:

Latch-free writes: Single packed AtomicUsize state word per buffer eliminates locks
Concurrent filling: Multiple threads write to one buffer simultaneously via atomic fetch-add
Automatic rotation: Buffers seal when full and trigger asynchronous flush
O_DIRECT compatibility: All allocations aligned to 4 KiB page boundaries

State Word Layout:

┌────────────────┬────────────────┬──────────────┬─────────────┬─────────┐
│  Bits 63..32   │  Bits 31..8    │  Bits 7..2   │  Bit 1      │  Bit 0  │
│  write offset  │  writer count  │  (reserved)  │  flushing   │  sealed │
└────────────────┴────────────────┴──────────────┴─────────────┴─────────┘

2. LogStructuredStore (`log_structured_store.rs`)

The durable backing file for page state. All writes flow through the FlushBufferRing and are dispatched via FlushBehavior.

Key Features:

Stability tracking: hi_stable pointer tracks highest contiguous durable slot
Out-of-order handling: completed_islands set manages non-contiguous completions
Crash recovery: Can rebuild consistent state from hi_stable watermark
Write amplification: Near-zero — buffers are written once at their assigned slots

Address Space:

byte_offset = local_lss_address_slot × FOUR_KB_PAGE

Slots are claimed atomically at buffer seal time, guaranteeing no two buffers ever map to the same file region.

3. FlushBehavior (`flush_behaviour.rs`)

Handles io_uring-backed write dispatch with two operating modes:

Mode	Description	Use Case
TailLocalizedWrites	Parallel writes within bounded tail region	High-throughput ingestion
SerializedWrites	Strict append order (`IO_LINK` flag)	WAL segments, recovery logs

Tail-Localized Writes:

Buffers write concurrently; flushes may complete out-of-order
Maximum write distance from tail: RING_SIZE × FOUR_KB_PAGE
Suitable for page-state storage where absolute ordering is not critical

Serialized Writes:

Each write waits for the previous to complete (IO_LINK)
Enforces submission-order on disk
Reduced parallelism but guaranteed ordering

Write Path

  Caller
    │
    │  write_payload(&[u8], Reservation)
    ▼
  reserve_space(payload_size)      ← atomic fetch-add on state word
    │
    ▼
  FlushBufferRing::put()            ← copy payload into aligned buffer
    │
    ├─ buffer not full ───────────────────► Ok(SuccessfulWrite)
    │
    └─ buffer full (sealed)
         │
         ├─ rotate ring to next available buffer
         │
         └─ FlushBehavior::submit_buffer()
                │
                ├─ NoWaitAppender  (TailLocalizedWrites)
                │    └─ io_uring write_at(offset)           ← unordered
                │
                └─ WaitAppender   (SerializedWrites)
                     └─ io_uring write_at(offset, IO_LINK)  ← strict order

Design Rationale

Why Log-Structured?

Flash-Optimized:

Sequential writes exploit flash erase-block characteristics
Reduces write amplification (each page written once)
Enables wear leveling across the device

Concurrency-Friendly:

No in-place updates → no contention on individual slots
Latch-free buffer management via atomic operations
Caller threads participate in flush without blocking

Recovery-Efficient:

hi_stable watermark provides instant crash-consistent snapshot
No need to scan entire file during recovery
Out-of-order completions handled transparently via island set

Why `O_DIRECT`?

Bypasses kernel page cache — DMA directly to device
Predictable latency — no cache eviction surprises
Lower memory pressure — userspace controls all buffering

Tradeoff: Requires strict alignment (4 KiB) for all buffers.

Why `io_uring`?

Asynchronous I/O — submit returns immediately, no blocking
Batched operations — submit multiple SQEs in one syscall
Zero-copy — kernel reads directly from userspace buffers
Polled completions — no context switch per completion

Planned Components

The full LLAMA design includes (not yet implemented):

Mapping Tables: Lock-free in-memory index for locating page deltas
Caching Layer: Physical storage for hot deltas in memory
Recovery Protocol: Rebuild mapping table after crashes

This repository currently implements the log-structured secondary storage foundation.

Usage Example

use bloom_lfs::log_structured_store::{LogStructuredStore, WriteMode};

// Open store with tail-localized writes (high throughput)
let store = LogStructuredStore::open_with_behavior(
    "/var/lib/llama/data.lss",
    WriteMode::TailLocalizedWrites,
)?;

// Reserve space, write payload, poll completions
let reservation = store.reserve_space(64)?;
store.write_payload(b"hello, LLAMA", reservation)?;
store.check_async_cque(); // Process completed writes

// For WAL or recovery log, use serialized mode
let wal = LogStructuredStore::open_with_behavior(
    "/var/lib/llama/wal.lss",
    WriteMode::SerializedWrites,
)?;

Performance Characteristics

Metric	Value
Buffer size	4 KiB (aligned to page boundary)
Ring size	4 buffers (default, configurable)
Max tail write distance	`RING_SIZE × 4 KiB` = 16 KiB (tail-localized)
Concurrency model	Latch-free (atomic operations only)
Write amplification	~1.0 (each page written once)
I/O pattern	Sequential append (flash-optimized)

Requirements

Linux kernel ≥ 5.1 (for io_uring support)
Rust ≥ 1.70 (for atomic operations and UnsafeCell patterns)

Dependencies

[dependencies]
io-uring = "0.6"
parking_lot = "0.12"
libc = "0.2"

Safety

All unsafe code is audited and justified:

FlushBuffer::write: Raw pointer dereference into aligned allocation
- Invariant: reserve_space grants exclusive, non-overlapping ranges
FlushBufferRing::current_buffer: Raw atomic pointer load
- Invariant: Ring guarantees pointer validity for store lifetime
io_uring SQE construction: Pointers stored in user_data
- Invariant: Buffers pinned in memory, flush-in-progress bit prevents reuse

Future Work

Implement mapping tables (lock-free hash table)
Add caching layer (LRU eviction policy)
Recovery protocol (replay log from hi_stable)
Benchmarking suite (vs. baseline file I/O)
Delta compression (reduce storage overhead)
Remote storage support (networked backing file)

References

LLAMA Paper: Lev-Ari et al., "LLAMA: A Cache/Storage Subsystem for Modern Hardware" (VLDB 2013)
io_uring: Axboe, Jens. "Efficient IO with io_uring" (kernel.dk)
Log-Structured Storage: Rosenblum & Ousterhout, "The Design and Implementation of a Log-Structured File System" (SOSP 1991)

License

[Specify your license here]

Authors

[Specify authors/contributors here]

bloom_lfs 0.1.0