bloom_lfs 0.1.1



# Bloom_lfs — LLAMA Log-Structured Storage

A high-performance, latch-free log-structured storage layer built for modern flash storage and multi-core systems.

## Overview

Bloom_lfs implements the storage foundation of LLAMA (Latch-free, Log-structured, Access-Method Aware) — a concurrent caching and storage subsystem designed to exploit the performance characteristics of append-only writes on flash media while avoiding the costs of random I/O and excessive write amplification.

The project currently implements the **log-structured secondary storage** component, which provides:

- **High write throughput** through batched, sequential writes
- **Latch-free concurrency** via atomic operations on packed state words
- **Low write amplification** with tail-localized append patterns
- **Direct I/O (`O_DIRECT`)** to bypass kernel page cache overhead
- **Asynchronous I/O** via `io_uring` for efficient kernel interactions

## Architecture

### Core Components

```
┌─────────────────────────────────────────────────────────────┐
│                    FlushBufferRing                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐     │
│  │ Buffer 0 │  │ Buffer 1 │  │ Buffer 2 │  │ Buffer 3 │     │
│  │  4 KB    │  │  4 KB    │  │  4 KB    │  │  4 KB    │     │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘     │
│      ▲                                                      │
│      └─── current_buffer (atomic pointer)                   │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │   FlushBehavior       │
              │  ┌─────────────────┐  │
              │  │ io_uring        │  │
              │  │ (async I/O)     │  │
              │  └─────────────────┘  │
              └───────────────────────┘
                          │
                          ▼
              ┌───────────────────────┐
              │ LogStructuredStore    │
              │  (O_DIRECT backing    │
              │   file on disk)       │
              └───────────────────────┘
```

### 1. **FlushBuffer & FlushBufferRing** (`flush_buffer.rs`)

A fixed-size ring of 4 KiB-aligned buffers that amortizes individual writes into batched sequential I/O operations.

**Key Features:**
- **Latch-free writes**: Single packed `AtomicUsize` state word per buffer eliminates locks
- **Concurrent filling**: Multiple threads write to one buffer simultaneously via atomic fetch-add
- **Automatic rotation**: Buffers seal when full and trigger asynchronous flush
- **`O_DIRECT` compatibility**: All allocations aligned to 4 KiB page boundaries

**State Word Layout:**
```
┌────────────────┬────────────────┬──────────────┬─────────────┬─────────┐
│  Bits 63..32   │  Bits 31..8    │  Bits 7..2   │  Bit 1      │  Bit 0  │
│  write offset  │  writer count  │  (reserved)  │  flushing   │  sealed │
└────────────────┴────────────────┴──────────────┴─────────────┴─────────┘
```

### 2. **LogStructuredStore** (`log_structured_store.rs`)

The durable backing file for page state. All writes flow through the `FlushBufferRing` and are dispatched via `FlushBehavior`.

**Key Features:**
- **Stability tracking**: `hi_stable` pointer tracks highest contiguous durable slot
- **Out-of-order handling**: `completed_islands` set manages non-contiguous completions
- **Crash recovery**: Can rebuild consistent state from `hi_stable` watermark
- **Write amplification**: Near-zero — buffers are written once at their assigned slots

**Address Space:**
```
byte_offset = local_lss_address_slot × FOUR_KB_PAGE
```

Slots are claimed atomically at buffer seal time, guaranteeing no two buffers ever map to the same file region.

### 3. **FlushBehavior** (`flush_behaviour.rs`)

Handles `io_uring`-backed write dispatch with two operating modes:

| Mode                      | Description                                    | Use Case                    |
|---------------------------|------------------------------------------------|-----------------------------|
| **TailLocalizedWrites**   | Parallel writes within bounded tail region     | High-throughput ingestion   |
| **SerializedWrites**      | Strict append order (`IO_LINK` flag)           | WAL segments, recovery logs |

**Tail-Localized Writes:**
- Buffers write concurrently; flushes may complete out-of-order
- Maximum write distance from tail: `RING_SIZE × FOUR_KB_PAGE`
- Suitable for page-state storage where absolute ordering is not critical

**Serialized Writes:**
- Each write waits for the previous to complete (`IO_LINK`)
- Enforces submission-order on disk
- Reduced parallelism but guaranteed ordering

## Write Path

```
  Caller
    │
    │  write_payload(&[u8], Reservation)
    ▼
  reserve_space(payload_size)      ← atomic fetch-add on state word
    │
    ▼
  FlushBufferRing::put()            ← copy payload into aligned buffer
    │
    ├─ buffer not full ───────────────────► Ok(SuccessfulWrite)
    │
    └─ buffer full (sealed)
         │
         ├─ rotate ring to next available buffer
         │
         └─ FlushBehavior::submit_buffer()
                │
                ├─ NoWaitAppender  (TailLocalizedWrites)
                │    └─ io_uring write_at(offset)           ← unordered
                │
                └─ WaitAppender   (SerializedWrites)
                     └─ io_uring write_at(offset, IO_LINK)  ← strict order
```

## Design Rationale

### Why Log-Structured?

**Flash-Optimized:**
- Sequential writes exploit flash erase-block characteristics
- Reduces write amplification (each page written once)
- Enables wear leveling across the device

**Concurrency-Friendly:**
- No in-place updates → no contention on individual slots
- Latch-free buffer management via atomic operations
- Caller threads participate in flush without blocking

**Recovery-Efficient:**
- `hi_stable` watermark provides instant crash-consistent snapshot
- No need to scan entire file during recovery
- Out-of-order completions handled transparently via island set

### Why `O_DIRECT`?

- **Bypasses kernel page cache** — DMA directly to device
- **Predictable latency** — no cache eviction surprises
- **Lower memory pressure** — userspace controls all buffering

**Tradeoff:** Requires strict alignment (4 KiB) for all buffers.

### Why `io_uring`?

- **Asynchronous I/O** — submit returns immediately, no blocking
- **Batched operations** — submit multiple SQEs in one syscall
- **Zero-copy** — kernel reads directly from userspace buffers
- **Polled completions** — no context switch per completion

## Planned Components

The full LLAMA design includes (not yet implemented):

- **Mapping Tables**: Lock-free in-memory index for locating page deltas
- **Caching Layer**: Physical storage for hot deltas in memory
- **Recovery Protocol**: Rebuild mapping table after crashes

This repository currently implements the **log-structured secondary storage** foundation.

## Usage Example

```rust
use bloom_lfs::log_structured_store::{LogStructuredStore, WriteMode};

// Open store with tail-localized writes (high throughput)
let store = LogStructuredStore::open_with_behavior(
    "/var/lib/llama/data.lss",
    WriteMode::TailLocalizedWrites,
)?;

// Reserve space, write payload, poll completions
let reservation = store.reserve_space(64)?;
store.write_payload(b"hello, LLAMA", reservation)?;
store.check_async_cque(); // Process completed writes

// For WAL or recovery log, use serialized mode
let wal = LogStructuredStore::open_with_behavior(
    "/var/lib/llama/wal.lss",
    WriteMode::SerializedWrites,
)?;
```

## Performance Characteristics

| Metric                    | Value                                          |
|---------------------------|------------------------------------------------|
| Buffer size               | 4 KiB (aligned to page boundary)               |
| Ring size                 | 4 buffers (default, configurable)              |
| Max tail write distance   | `RING_SIZE × 4 KiB` = 16 KB (tail-localized)   |
| Concurrency model         | Latch-free (atomic operations only)            |
| Write amplification       | ~1.0 (each page written once)                  |
| I/O pattern               | Sequential append (flash-optimized)            |

## Requirements

- **Linux kernel ≥ 5.1** (for `io_uring` support)
- **Rust ≥ 1.70** (for atomic operations and `UnsafeCell` patterns)

## Dependencies

```toml
[dependencies]
io-uring = "0.6"
parking_lot = "0.12"
libc = "0.2"
```

## Safety

All `unsafe` code is audited and justified:

- **`FlushBuffer::write`**: Raw pointer dereference into aligned allocation
  - **Invariant**: `reserve_space` grants exclusive, non-overlapping ranges
- **`FlushBufferRing::current_buffer`**: Raw atomic pointer load
  - **Invariant**: Ring guarantees pointer validity for store lifetime
- **`io_uring` SQE construction**: Pointers stored in user_data
  - **Invariant**: Buffers pinned in memory, flush-in-progress bit prevents reuse

## Future Work

- [ ] Add caching layer (LRU eviction policy)
- [ ] Recovery protocol (replay log from `hi_stable`)
- [ ] Remote storage support (networked backing file)

## References

- **LLAMA Paper**: Lev-Ari et al., "LLAMA: A Cache/Storage Subsystem for Modern Hardware" (VLDB 2013)