stoolap 0.4.0

High-performance embedded SQL database with MVCC, time-travel queries, and full ACID compliance
Documentation
---
layout: doc
title: MVCC Implementation
category: Architecture
order: 4
---

# MVCC Implementation

This document provides a detailed explanation of Stoolap's Multi-Version Concurrency Control (MVCC) implementation, which enables transaction isolation without locking.

## MVCC Overview

Multi-Version Concurrency Control (MVCC) is a concurrency control method used by Stoolap to provide transaction isolation. The key principles are:

1. Maintain full version chains for each row with unlimited history
2. Track deletion status with transaction IDs for proper visibility
3. Each transaction has a consistent view based on visibility rules
4. Reads never block writes, and writes never block reads
5. Implement optimistic concurrency control for conflict detection

## Design Philosophy

Stoolap implements a **true multi-version MVCC** design:

- **Full Version Chains**: Unlimited version history per row linked via `prev` pointers
- **In-Memory Chains**: Version chains built from WAL replay during recovery
- **Immutable Versions**: New versions always created, never modified in place
- **Efficient Persistence**: Latest version persisted to frozen volumes via checkpoint cycle
- **Automatic Cleanup**: Old versions garbage collected when no longer needed

## Core Components

### Transaction Registry

- Manages transaction lifecycle and state tracking
- Assigns unique transaction IDs using atomic counters
- Tracks active and committed transactions with monotonic sequences
- Supports per-transaction isolation levels without race conditions
- Implements visibility rules for both READ COMMITTED and SNAPSHOT isolation

### Version Store

- Maintains full version chains for each row
- Tracks both creation (`txn_id`) and deletion (`deleted_at_txn_id`) transaction IDs
- Implements efficient concurrent access using concurrent data structures
- Manages B-tree, Hash, Bitmap, and HNSW indexes
- Provides visibility-aware traversal of version chains

### Row Version Structure

```rust
struct RowVersion {
    txn_id: i64,           // Transaction that created this version
    deleted_at_txn_id: i64, // Transaction that deleted this version (0 if not deleted)
    data: Row,             // Complete row data
    create_time: i64,      // Timestamp when created
}

struct VersionChainEntry {
    version: RowVersion,
    prev: Option<Arc<VersionChainEntry>>, // Previous version in the chain
    arena_idx: Option<NonZeroU64>,        // Index into arena for zero-copy access
}
```

The `prev` pointer on `VersionChainEntry` creates a backward-linked chain from newest to oldest version. `Arc` allows cheap cloning of version chains across concurrent readers.

## Transaction IDs and Timestamps

Stoolap uses monotonic sequences instead of wall-clock timestamps to avoid platform-specific timing issues:

- **Transaction ID**: Unique identifier assigned atomically
- **Begin Sequence**: Monotonic sequence when transaction starts
- **Commit Sequence**: Monotonic sequence when transaction commits
- **Write Sequences**: Track when rows were last modified for conflict detection

This approach solves Windows' 15.6ms timer resolution issue and ensures consistent ordering.

## Isolation Levels

### READ COMMITTED (Default)

- Transactions see committed changes immediately
- No global locks for commits - high concurrency
- Each statement sees the latest committed data
- Suitable for most OLTP workloads

Implementation:
```rust
// In READ COMMITTED, only check if transaction is committed
fn is_directly_visible(&self, version_txn_id: i64) -> bool {
    self.committed_transactions.contains(version_txn_id)
}
```

### SNAPSHOT Isolation

- Transactions see a consistent snapshot from when they started
- Write-write conflict detection prevents lost updates
- Lock-free commits with optimistic concurrency control
- High throughput with strong consistency guarantees

Implementation:
```rust
// In SNAPSHOT, check if version was committed before viewer began
fn is_visible(&self, version_txn_id: i64, viewer_txn_id: i64) -> bool {
    if let Some(commit_ts) = self.committed_transactions.get(version_txn_id) {
        let viewer_begin_ts = self.get_transaction_begin_seq(viewer_txn_id);
        commit_ts <= viewer_begin_ts
    } else {
        false
    }
}
```

## Visibility Rules

### Version Chain Traversal

Visibility is determined by traversing the version chain:
```rust
// Traverse the version chain from newest to oldest
let mut current = version_ptr;
while let Some(version) = current {
    if registry.is_visible(version.txn_id, txn_id) {
        // Check deletion visibility
        if version.deleted_at_txn_id != 0
            && registry.is_visible(version.deleted_at_txn_id, txn_id) {
            return None; // Deleted
        }
        return Some(version); // Found visible version
    }
    current = version.prev.as_ref();
}
```

### Row Visibility

A row is visible to a transaction if:
1. A version exists in the chain that was created by a visible transaction, AND
2. That version was NOT deleted, OR the deletion is not visible

### Transaction-Specific Isolation

Each transaction maintains its own isolation level:
```rust
// Set isolation level for specific transaction
registry.set_transaction_isolation_level(txn_id, level);

// Get isolation level for visibility checks
let isolation_level = registry.get_isolation_level(txn_id);
```

## Concurrency Control

### SNAPSHOT Isolation Conflicts

Write-write conflict detection during commit:

```rust
// Check for conflicts before commit
if version_store.check_write_conflict(&written_rows, begin_seq) {
    return Err(StoolapError::Transaction(
        "transaction aborted due to write-write conflict".to_string()
    ));
}

// Set write sequences after successful commit
version_store.set_write_sequences(&written_rows, commit_seq);
```

### Lock-Free Commit Process

SNAPSHOT commits use optimistic concurrency control:
1. Check for write-write conflicts
2. Generate commit sequence
3. Apply changes to version stores (creating new versions)
4. Set write sequences atomically
5. Mark transaction as committed

No global mutex needed - conflicts detected through version checks.

## Version Chain Management

### Updates

When a row is updated:
- A new version is created with the updating transaction's ID
- The new version's `prev` pointer links to the current version
- The version chain grows backward in time
- All historical versions remain accessible

### Deletions

When a row is deleted:
- A new version is created with `DeletedAtTxnID` set
- The deletion version links to the previous version
- Data is preserved in the deletion version
- The row appears deleted to transactions that see this version

### Version Chain Example

```
[Newest] -> Version 3 (TxnID=300, DeletedAt=400) 
              |
              v
            Version 2 (TxnID=200)
              |
              v
            Version 1 (TxnID=100) -> [Oldest]
```

## Performance Optimizations

### Lock-Free Data Structures

- `I64Map`: Custom high-performance hash map for i64 keys with zero-collision sequential key hashing
- Atomic operations for counters and flags
- Minimal mutex usage in hot paths
- `parking_lot::RwLock` for fine-grained concurrent access

### Optimized Visibility Checks

- Fast path for own-transaction visibility
- Direct visibility check for READ COMMITTED
- Batch processing for bulk operations

### Memory Management

- Version chains built on-demand from WAL
- Automatic cleanup of old versions no longer needed
- Periodic cleanup of deleted rows
- Arena-based row storage for cache-efficient access

## Garbage Collection

### Version Chain Cleanup

Old versions in chains are cleaned up when:
1. No active transaction can see them
2. A newer version is visible to all active transactions

```rust
// Find oldest version still needed
let mut current = version;
let mut last_needed = None;
while let Some(v) = current {
    for txn_id in &active_transactions {
        if registry.is_visible(v.txn_id, *txn_id) {
            last_needed = Some(v);
            break;
        }
    }
    current = v.prev.as_ref();
}
// Disconnect older versions (Rust's ownership handles cleanup)
if let Some(last) = last_needed {
    last.prev = None; // Drop older versions
}
```

### Deleted Row Cleanup

Deleted rows are removed based on:
1. Retention period (age-based)
2. Transaction visibility (no active transaction can see them)
3. Safety checks to prevent removing visible data

## Key Implementation Files

- `src/storage/mvcc/engine.rs` - MVCC engine coordinating all components
- `src/storage/mvcc/table.rs` - Table operations with MVCC support
- `src/storage/mvcc/transaction.rs` - Transaction implementation and conflict detection
- `src/storage/mvcc/version_store.rs` - Row version storage and management

## Best Practices

1. **Choose Appropriate Isolation**: Use READ COMMITTED unless you need snapshot consistency
2. **Keep Transactions Short**: Long transactions delay garbage collection
3. **Handle Conflicts**: Implement retry logic for SNAPSHOT conflicts
4. **Monitor Deleted Rows**: Ensure garbage collection keeps up with deletions
5. **Batch Operations**: Group related changes in single transactions

## Implemented Features

The following advanced features are fully implemented:
1. **Time Travel Queries**: `AS OF TIMESTAMP` and `AS OF TRANSACTION` for querying historical data
2. **Isolation Level Aliases**: `REPEATABLE READ`, `SERIALIZABLE`, `SNAPSHOT` all map to Snapshot Isolation
3. **Savepoint Support**: `SAVEPOINT`, `ROLLBACK TO SAVEPOINT`, `RELEASE SAVEPOINT` for partial rollbacks
4. **Arena-Based Storage**: Cache-efficient row storage with zero-copy access via `NonZeroU64` indices

## Future Improvements

Potential enhancements to the current design:
1. **Read-Set Tracking**: Detect read-write conflicts for true SERIALIZABLE isolation
2. **Version Compression**: Delta encoding for version chains