prollytree 0.3.2

# ProllyTree Storage Backends Guide

ProllyTree supports multiple storage backends to meet different performance, persistence, and deployment requirements. This guide provides a comprehensive overview of each available storage backend, their characteristics, use cases, and configuration options.

## Overview

ProllyTree uses a pluggable storage architecture through the `NodeStorage` trait, allowing you to choose the appropriate backend for your specific needs:

- **InMemoryNodeStorage**: Fast, volatile storage for development and testing
- **FileNodeStorage**: Simple file-based persistence for local applications
- **RocksDBNodeStorage**: High-performance LSM-tree storage for production workloads
- **GitNodeStorage**: Git object store integration for development (experimental)

## InMemoryNodeStorage

### Description
The in-memory storage backend keeps all ProllyTree nodes in a `HashMap` in RAM. This provides the fastest access times but offers no persistence across application restarts.

### Characteristics
- **Performance**: Fastest read/write operations
- **Persistence**: None - data is lost when application terminates
- **Memory Usage**: Entire tree stored in RAM
- **Concurrency**: Thread-safe with internal locking
- **Storage Overhead**: Minimal (just HashMap overhead)

### Use Cases
- **Unit testing**: Fast test execution without I/O overhead
- **Development**: Quick prototyping and debugging
- **Caching layer**: Temporary storage for frequently accessed data
- **Small datasets**: When entire dataset fits comfortably in memory

### Usage Example
```rust
use prollytree::storage::InMemoryNodeStorage;
use prollytree::tree::{ProllyTree, Tree};
use prollytree::config::TreeConfig;

let storage = InMemoryNodeStorage::<32>::new();
let config = TreeConfig::<32>::default();
let mut tree = ProllyTree::new(storage, config);

// Data will be lost when `tree` goes out of scope
tree.insert(b"key".to_vec(), b"value".to_vec());
```

### Configuration
The in-memory storage is self-contained and requires no configuration. It automatically manages memory allocation and cleanup.

## FileNodeStorage

### Description
The file storage backend persists each ProllyTree node as a separate file on the filesystem using binary serialization. Configuration data is stored in separate files with a `config_` prefix.

### Characteristics
- **Performance**: Moderate - limited by filesystem I/O
- **Persistence**: Full persistence across application restarts
- **Storage Format**: Binary-serialized nodes (using bincode)
- **File Organization**: One file per node, named by hash
- **Platform Support**: Works on all platforms with filesystem access

### Use Cases
- **Local applications**: Desktop applications needing persistence
- **Development**: When you need persistence but don't want database setup
- **Small to medium datasets**: Up to thousands of nodes
- **Debugging**: Easy to inspect individual node files

### Usage Example
```rust
use prollytree::storage::FileNodeStorage;
use prollytree::tree::{ProllyTree, Tree};
use prollytree::config::TreeConfig;
use std::path::PathBuf;

let storage_dir = PathBuf::from("./prolly_data");
let storage = FileNodeStorage::<32>::new(storage_dir);
let config = TreeConfig::<32>::default();
let mut tree = ProllyTree::new(storage, config);

tree.insert(b"key".to_vec(), b"value".to_vec());
// Data persists in ./prolly_data/ directory
```

### File Structure
```
prolly_data/
├── a1b2c3d4e5f6... (node file - hex hash)
├── f6e5d4c3b2a1... (node file - hex hash)
├── config_tree_config (configuration file)
└── config_custom_key (custom configuration)
```

### Limitations
- **Scalability**: Performance degrades with large number of nodes
- **Atomicity**: No atomic updates across multiple nodes
- **Concurrent Access**: Not safe for concurrent writers

## RocksDBNodeStorage

### Description
RocksDB storage provides a production-ready, high-performance backend using Facebook's RocksDB LSM-tree implementation. It's optimized for ProllyTree's content-addressed, write-heavy workload patterns.

### Characteristics
- **Performance**: High throughput for both reads and writes
- **Persistence**: Durable storage with WAL (Write-Ahead Log)
- **Scalability**: Handles millions of nodes efficiently
- **Compression**: LZ4 for hot data, Zstd for cold data
- **Caching**: Multi-level caching (LRU cache + RocksDB block cache)
- **Compaction**: Background cleanup of obsolete data

### Architecture
```
Application
    ↓
LRU Cache (1000 nodes default)
    ↓
RocksDB
├── Write Buffer (128MB)
├── Block Cache (512MB)
├── Bloom Filters (10 bits/key)
└── SST Files (compressed)
```

### Use Cases
- **Production applications**: High-performance persistent storage
- **Large datasets**: Millions of nodes and frequent updates
- **Write-heavy workloads**: Frequent tree modifications
- **Distributed systems**: Building block for distributed storage

### Usage Example
```rust
use prollytree::storage::RocksDBNodeStorage;
use prollytree::tree::{ProllyTree, Tree};
use prollytree::config::TreeConfig;
use std::path::PathBuf;

// Basic usage
let db_path = PathBuf::from("./rocksdb_data");
let storage = RocksDBNodeStorage::<32>::new(db_path)?;
let config = TreeConfig::<32>::default();
let mut tree = ProllyTree::new(storage, config);

// Custom cache size
let storage = RocksDBNodeStorage::<32>::with_cache_size(db_path, 5000)?;

// Custom RocksDB options
let mut opts = RocksDBNodeStorage::<32>::default_options();
opts.set_write_buffer_size(256 * 1024 * 1024); // 256MB
let storage = RocksDBNodeStorage::<32>::with_options(db_path, opts)?;
```

### Configuration Options

#### Default Optimizations
- **Write Buffer**: 128MB for batching writes
- **Memory Tables**: Up to 4 concurrent memtables
- **Compression**: LZ4 for L0-L2, Zstd for bottom levels
- **Block Cache**: 512MB for frequently accessed data
- **Bloom Filters**: 10 bits per key for faster lookups

#### Performance Tuning
```rust
use rocksdb::{Options, DBCompressionType, BlockBasedOptions, Cache};

let mut opts = Options::default();

// Increase write buffer for high write throughput
opts.set_write_buffer_size(256 * 1024 * 1024);

// More aggressive compression for storage efficiency
opts.set_compression_type(DBCompressionType::Zstd);

// Larger block cache for read-heavy workloads
let cache = Cache::new_lru_cache(1024 * 1024 * 1024); // 1GB
let mut block_opts = BlockBasedOptions::default();
block_opts.set_block_cache(&cache);
opts.set_block_based_table_factory(&block_opts);
```

### Batch Operations
RocksDB storage supports efficient batch operations:

```rust
let nodes = vec![
    (hash1, node1),
    (hash2, node2),
    (hash3, node3),
];

// Atomic batch insert
storage.batch_insert_nodes(nodes)?;

// Atomic batch delete
storage.batch_delete_nodes(&[hash1, hash2])?;
```

### Monitoring and Maintenance
- **Statistics**: RocksDB provides detailed performance metrics
- **Compaction**: Automatic background compaction
- **Backup**: Use RocksDB backup utilities for data safety
- **Tuning**: Monitor write amplification and adjust settings

## GitNodeStorage

### Description
The Git storage backend stores ProllyTree nodes as Git blob objects in a Git repository. This experimental backend is designed for development workflows where you want to leverage Git's content-addressable storage.

### ⚠️ Important Limitations

**Development Use Only**: GitNodeStorage should only be used for local development and experimentation. It is not suitable for production use due to several important limitations:

1. **Dangling Objects**: ProllyTree nodes are stored as Git blob objects but are **not committed** to any branch or tag. These objects exist as "dangling" or "unreachable" objects in Git's object database.

2. **Garbage Collection Risk**: Git's garbage collector (`git gc`) will **delete these dangling objects** during cleanup operations. This can happen:
   - When running `git gc` manually
   - Automatically during Git operations (push, pull, repack, etc.)
   - When Git's automatic garbage collection triggers

3. **Data Loss**: Since the objects are not referenced by any commit, branch, or tag, they will be permanently lost when garbage collected. There is no recovery mechanism.

### Characteristics
- **Storage Format**: Git blob objects (binary serialized nodes)
- **Content Addressing**: Leverages Git's SHA-1 content addressing
- **Persistence**: Temporary - objects can be garbage collected
- **Integration**: Works with existing Git repositories
- **Caching**: LRU cache for performance

### Use Cases (Development Only)
- **Git Integration Experiments**: Testing Git-based storage concepts
- **Development Workflows**: Temporary storage during development
- **Learning**: Understanding content-addressable storage
- **Prototyping**: Rapid prototyping with Git infrastructure

### Usage Example
```rust
// Only available with "git" feature
#[cfg(feature = "git")]
use prollytree::git::GitNodeStorage;

let repo = gix::open(".")?;
let dataset_dir = std::path::PathBuf::from("./git_data");
let storage = GitNodeStorage::<32>::new(repo, dataset_dir)?;

// ⚠️ WARNING: Data may be lost during git gc!
let config = TreeConfig::<32>::default();
let mut tree = ProllyTree::new(storage, config);
tree.insert(b"key".to_vec(), b"value".to_vec());
```

### Data Safety Measures

If you must use GitNodeStorage for development, consider these safety measures:

1. **Disable Automatic GC**:
   ```bash
   git config gc.auto 0
   git config gc.autopacklimit 0
   ```

2. **Create Temporary Commits** (advanced):
   ```bash
   # Periodically commit to preserve objects
   git add -A
   git commit -m "temp: preserve prolly objects"
   ```

3. **Use Separate Repository**:
   Create a dedicated Git repository just for ProllyTree storage to avoid conflicts.

### Architecture
```
ProllyTree Node
    ↓
Bincode Serialization
    ↓
Git Blob Object (dangling)
    ↓
Git Object Database
    ↓
⚠️ git gc → Deletion
```

## Storage Backend Comparison

| Feature | InMemory | File | RocksDB | Git |
|---------|----------|------|---------|-----|
| **Persistence** | None | Full | Full | Temporary⚠️ |
| **Performance** | Fastest | Moderate | High | Moderate |
| **Scalability** | RAM-limited | Poor | Excellent | Poor |
| **Setup Complexity** | None | None | Low | Medium |
| **Production Ready** | No | Limited | Yes | No⚠️ |
| **Concurrent Access** | Limited | No | Yes | Limited |
| **Storage Overhead** | None | High | Low | Medium |
| **Backup/Recovery** | N/A | File copy | RocksDB tools | Git tools |

## Choosing the Right Backend

### Development & Testing
- **Unit Tests**: InMemoryNodeStorage
- **Integration Tests**: FileNodeStorage or InMemoryNodeStorage
- **Local Development**: FileNodeStorage or RocksDBNodeStorage

### Production Deployments
- **Small Applications**: FileNodeStorage (with careful consideration)
- **High-Performance Applications**: RocksDBNodeStorage
- **Distributed Systems**: RocksDBNodeStorage as foundation

### Experimental
- **Git Integration Research**: GitNodeStorage (development only)

## Performance Benchmarks

Run the storage comparison benchmarks to understand performance characteristics:

```bash
# Compare all available backends
cargo bench --bench storage_bench --features rocksdb_storage

# Run specific benchmark
cargo bench --bench storage_bench storage_insert
```

## Migration Between Backends

Currently, there's no built-in migration tool between storage backends. To migrate:

1. **Export Data**: Iterate through the old storage and collect all key-value pairs
2. **Create New Storage**: Initialize the target storage backend
3. **Import Data**: Insert all data into the new storage
4. **Validate**: Verify data integrity after migration

Example migration pattern:
```rust
// Export from old storage
let old_tree = ProllyTree::load_from_storage(old_storage, config.clone())?;
let mut data = Vec::new();
// ... collect all key-value pairs

// Import to new storage
let mut new_tree = ProllyTree::new(new_storage, config);
for (key, value) in data {
    new_tree.insert(key, value);
}
```

## Best Practices

### General
- Choose the simplest backend that meets your requirements
- Always benchmark with your specific data patterns
- Consider backup and recovery procedures
- Plan for data growth and scaling needs

### InMemoryNodeStorage
- Monitor memory usage to prevent OOM conditions
- Use for temporary data only
- Consider data loss implications

### FileNodeStorage
- Ensure adequate disk space and I/O performance
- Implement application-level locking for concurrent access
- Regular filesystem maintenance and monitoring

### RocksDBNodeStorage
- Monitor RocksDB metrics for performance tuning
- Configure appropriate cache sizes for your workload
- Plan for disk space and compaction overhead
- Use batch operations for bulk updates

### GitNodeStorage
- **Never use in production**
- Disable automatic garbage collection during development
- Use dedicated Git repositories
- Regularly backup important data to commits
- Understand that data can be lost without warning

## Troubleshooting

### Common Issues

#### OutOfMemory with InMemoryNodeStorage
- Reduce dataset size or switch to persistent storage
- Monitor heap usage and tune JVM/runtime parameters

#### Poor Performance with FileNodeStorage
- Check filesystem performance and available disk space
- Consider switching to RocksDBNodeStorage for better performance
- Reduce concurrent access patterns

#### RocksDB Compilation Issues
- Ensure proper build tools (cmake, C++ compiler)
- Check RocksDB system dependencies
- Use pre-built binaries if available

#### Git Storage Data Loss
- This is expected behavior - objects are not committed
- Disable garbage collection or switch to persistent storage
- Create periodic commits to preserve important data

For additional help, consult the project documentation or open an issue on the GitHub repository.