vectorlite 0.1.5

I'll help you design an API for loading and saving data to disk for your vectorlite project. Let me first explore the codebase to understand the current architecture and data structures.

[1 tool called]

Let me examine the collection structure and understand what data needs to be persisted:

[7 tools called]

Based on my exploration of the vectorlite codebase, I'll create a comprehensive plan for implementing data persistence functionality. Let me first analyze the current architecture and then provide a detailed implementation strategy.

[1 tool called]

## 📋 Persistence API Implementation Plan

Based on my analysis of your codebase, here's a comprehensive plan for implementing data persistence:

### **Current State Analysis**
- ✅ Both `FlatIndex` and `HNSWIndex` already have `Serialize`/`Deserialize` implementations
- ✅ `VectorIndexWrapper` enum has serialization support
- ✅ `Vector` and `SearchResult` structs are serializable
- ❌ No persistence API exists - everything is currently in-memory
- ❌ `VectorLiteClient` and `Collection` lack persistence methods

### **Key Design Decisions**

#### **1. What to Save: Collections vs Complete Database**
I recommend implementing **both levels** of persistence:

**Collection-Level Persistence** (Individual Collections):
- Save/load individual collections independently
- Useful for selective backup/restore
- Enables sharing specific collections
- Lower memory footprint for partial operations

**Database-Level Persistence** (Entire VectorLiteClient):
- Save/load all collections at once
- Maintain consistency across collections
- Simpler for full backup/restore scenarios
- Preserve global state (e.g., embedding function configuration)

#### **2. Loading Strategy: Merge vs Override**
Implement **both strategies** with clear API distinction:

**Override Mode (Default)**:
- Clear existing data before loading
- Simpler mental model
- Prevents ID conflicts
- Use case: Restoring from backup

**Merge Mode (Optional)**:
- Append loaded data to existing collections
- Handle ID conflicts by remapping
- Preserve existing data
- Use case: Combining datasets

### **Proposed API Design**

```rust
// Collection-level persistence
impl Collection {
    pub fn save_to_file(&self, path: &Path) -> Result<(), PersistenceError>;
    pub fn load_from_file(path: &Path) -> Result<Self, PersistenceError>;
    pub fn merge_from_file(&mut self, path: &Path) -> Result<MergeStats, PersistenceError>;
}

// Client-level persistence  
impl VectorLiteClient {
    pub fn save_to_directory(&self, dir: &Path) -> Result<(), PersistenceError>;
    pub fn load_from_directory(dir: &Path) -> Result<Self, PersistenceError>;
    pub fn merge_from_directory(&mut self, dir: &Path) -> Result<MergeStats, PersistenceError>;
    
    // Collection-specific operations
    pub fn save_collection(&self, name: &str, path: &Path) -> Result<(), PersistenceError>;
    pub fn load_collection(&mut self, path: &Path, mode: LoadMode) -> Result<String, PersistenceError>;
}

#[derive(Debug)]
pub enum LoadMode {
    Override,    // Replace existing collection
    Merge,       // Merge with existing collection
    CreateNew(String), // Load with new name
}

pub struct MergeStats {
    pub vectors_added: usize,
    pub vectors_skipped: usize,
    pub id_remappings: HashMap<u64, u64>,
}
```

### **File Format Design**

#### Directory Structure (for full database):
```
vectorlite_db/
├── manifest.json          # Database metadata & version
├── collections/
│   ├── documents.vlc      # Collection file
│   ├── embeddings.vlc     # Collection file
│   └── ...
└── config.json            # Optional: embedding model config
```

#### Collection File Format (.vlc):
```json
{
  "version": "1.0.0",
  "metadata": {
    "name": "documents",
    "created_at": "2025-10-21T10:00:00Z",
    "vector_count": 1000,
    "dimension": 768,
    "index_type": "HNSW"
  },
  "index": {
    // Serialized VectorIndexWrapper
  }
}
```

**Note**: The `next_id` is automatically calculated from the maximum vector ID in the index + 1, so it doesn't need to be stored in the file format. This makes the persistence more robust and prevents inconsistencies.

### **Implementation Steps**

#### **Phase 1: Core Persistence Module**
```rust
// src/persistence.rs
pub mod persistence {
    use serde::{Serialize, Deserialize};
    use std::path::Path;
    use std::fs;
    
    #[derive(Debug, Serialize, Deserialize)]
    pub struct FileHeader {
        pub version: String,
        pub format: String,
        pub created_at: chrono::DateTime<Utc>,
    }
    
    #[derive(Debug, Serialize, Deserialize)]
    pub struct CollectionData {
        pub header: FileHeader,
        pub metadata: CollectionMetadata,
        pub index: VectorIndexWrapper,
        pub next_id: u64,
    }
    
    pub fn save_with_compression<T: Serialize>(
        data: &T, 
        path: &Path, 
        compress: bool
    ) -> Result<(), PersistenceError>;
    
    pub fn load_with_validation<T: DeserializeOwned>(
        path: &Path
    ) -> Result<T, PersistenceError>;
}
```

#### **Phase 2: Collection Persistence**
- Implement serialization for Collection struct
- Handle thread-safe access during save/load
- Implement ID remapping for merge operations

#### **Phase 3: Client Persistence**
- Save all collections to directory structure
- Implement atomic operations (temp files → rename)
- Add progress callbacks for large operations

#### **Phase 4: Advanced Features**
- Compression support (gzip/zstd)
- Incremental saves (only modified collections)
- Format migration utilities
- Concurrent collection loading

### **Handling Edge Cases**

1. **ID Conflicts During Merge**:
   - Generate mapping table: old_id → new_id
   - Update all references in the index
   - Return mapping to caller

2. **Version Compatibility**:
   - Embed version in file format
   - Implement migration functions
   - Fail gracefully on incompatible versions

3. **Corrupted Files**:
   - Validate checksums
   - Atomic writes with temp files
   - Backup before overwrite option

4. **Large Collections**:
   - Stream processing for memory efficiency
   - Optional compression
   - Progress reporting

### **Testing Strategy**

```rust
#[cfg(test)]
mod persistence_tests {
    #[test]
    fn test_collection_roundtrip() { /* ... */ }
    
    #[test]
    fn test_merge_with_id_conflicts() { /* ... */ }
    
    #[test]
    fn test_version_migration() { /* ... */ }
    
    #[test]
    fn test_concurrent_save_load() { /* ... */ }
    
    #[test]
    fn test_corruption_recovery() { /* ... */ }
}
```

### **Usage Examples**

```rust
// Save entire database
client.save_to_directory(Path::new("./my_vectors"))?;

// Load entire database
let client = VectorLiteClient::load_from_directory(Path::new("./my_vectors"))?;

// Save specific collection
client.save_collection("documents", Path::new("./docs.vlc"))?;

// Load and merge collection
client.load_collection(
    Path::new("./external_docs.vlc"), 
    LoadMode::Merge
)?;

// Override existing collection
client.load_collection(
    Path::new("./backup.vlc"),
    LoadMode::Override  
)?;
```

### **Recommended Implementation Order**

1. **Start with Collection-level persistence** (simpler, self-contained)
2. **Add basic save/load with override mode**
3. **Implement merge functionality with ID remapping**
4. **Add Client-level persistence** 
5. **Implement versioning and migration**
6. **Add compression and optimization features**

This plan provides flexibility for both simple use cases (full backup/restore) and advanced scenarios (selective collection management, dataset merging). The design is extensible for future features like encryption, cloud storage backends, or streaming operations.

Would you like me to start implementing this plan, beginning with the collection-level persistence?