shardex 0.1.0

A high-performance memory-mapped vector search engine with ACID transactions and incremental updates
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
# Shardex Implementation Plan

## Project Overview
Shardex is a memory mapped vector search engine implemented in Rust. It supports incremental updating of index entries and document postings which are kept in parallel to the vectors. Indexes are arranged in shards, each of which consists of two memory mapped data structures: embedding vectors and postings. An index will have a dynamic number of shards. When an index starts, it has one shard. The main interface to use a Shardex is async. Each index can only be used by a single process.

## Architecture

### Core Components

#### Shards
- Each shard consists of two memory mapped data structures: embedding vectors and postings
- Each shard has a 128-bit identifier (ULID) for direct identification
- Configurable maximum number of entries (vectors paired with postings)
- Embedding vector size configured at creation time
- Represented by a centroid of the vectors of non-deleted postings
- Smart enough to only consider postings that have been made (may not be full)

#### Postings
- 128-bit user selected identifier (UUID or ULID typically)
- 32-bit starting location
- 32-bit length  
- Fixed size vector
- Deleted bit in metadata
- Can be Added or Removed
- Multiple postings per document ID supported

#### Write-Ahead Log (WAL)
- Transactional system with memory mapped log
- WAL is memory mapped, so it will need to be in fixed size segments
- Add/Remove postings stored in WAL first
- Entries copied from WAL to shard with nearest centroid
- May trigger shard splits
- WAL pointer advanced after successful flush
- Bloom filters included in WAL transactions
- Batch writing every n milliseconds for performance

#### Bloom Filter Index
- Memory mapped bloom filter index integrated into centroid index
- Stored as parallel vector to centroids and shard metadata
- Allows finding candidate shards given a document identifier
- Enables efficient deletion of all postings for a document

#### Shardex (The In-Memory Index)
- This in-memory index IS the Shardex - it keeps track of the shards
- Memory mapped so it can load directly into memory on startup
- Maintained as copy-on-write memory map for efficient updates
- Allows chunk-based resizing
- Parallel memory mapped structure with centroid vector parallel indexed to shard metadata vector
- Each shard identified by 128-bit ULID for direct shard access
- Contains memory mapped bloom filter based on document identifier kept in parallel vector to centroids and metadata
- When the index grows large enough to require expanding (from shard splits), a new segment is created

#### Search System
- K nearest neighbors search
- Uses the Shardex (in-memory index) to determine nearest shards
- Finding of near shard centroids can be done in parallel across segments
- Configurable 'slop' factor for number of near shards to scan
- Zero-copy operations
- Duplicate elimination in search results

### Data Flow

#### Write Operations
1. Postings added to WAL
2. Batch processing every n milliseconds
3. Copy from WAL to appropriate shard (nearest centroid)
4. Shard split if full (clustering into two half-full shards)
5. During split, check if each posting should remain in its new sub-shard or be reassigned to another existing shard with closer centroid
6. Handle cascading splits if reassigned postings cause other shards to overflow
7. Flush shards to disk
8. Advance WAL pointer
9. Update bloom filters

#### Read Operations
- Non-blocking (readers don't block batch writer)
- Search via vector similarity using shard centroids
- Return postings for K nearest neighbors
- Eliminate duplicates from results

#### Recovery
- On startup, recover from WAL
- Prone to duplicate postings
- Duplicate elimination during search and shard splits



## Technical Requirements

### Performance
- Zero-copy search operations
- Non-blocking reads during writes
- Batch processing for write optimization
- Memory mapped file access throughout

### Reliability
- ACID transactions via WAL
- Recovery from unexpected shutdowns
- Duplicate handling and elimination
- Data integrity verification

### Scalability
- Dynamic shard management
- Configurable shard sizes
- Efficient bloom filter lookups
- Optimized clustering algorithms

## Testing Strategy

### Unit Tests
- Individual component testing using TestEnvironment RAII blocks
- Each test uses isolated temporary directory for parallel execution
- #[serial] is strictly forbidden - all tests must run in parallel
- Memory mapping validation
- Vector operation correctness
- WAL transaction integrity

### Integration Tests
- End-to-end write/read cycles using TestEnvironment RAII blocks
- Shard splitting scenarios in isolated temporary directories
- Recovery testing with per-test environments
- Concurrent access patterns within single process

### Performance Tests
- Large dataset handling with TestEnvironment isolation
- Search performance benchmarks in temporary directories
- Memory usage optimization testing
- Load testing within single process constraints

## API Design & Interface

### Core Trait
```rust
#[async_trait]
pub trait Shardex {
    type Error;
    
    /// Create a new index with the given configuration
    async fn create(config: ShardexConfig) -> Result<Self, Self::Error>
    where
        Self: Sized;
    
    /// Open an existing index (configuration is read from metadata)
    async fn open<P: AsRef<Path>>(directory_path: P) -> Result<Self, Self::Error>
    where
        Self: Sized;
    
    /// Add postings to the index (batch operation only)
    async fn add_postings(&mut self, postings: Vec<Posting>) -> Result<(), Self::Error>;
    
    /// Remove all postings for documents (batch operation only)
    async fn remove_documents(&mut self, document_ids: Vec<u128>) -> Result<(), Self::Error>;
    
    /// Search for K nearest neighbors
    async fn search(
        &self,
        query_vector: &[f32],
        k: usize,
        slop_factor: Option<usize>
    ) -> Result<Vec<SearchResult>, Self::Error>;
    
    /// Flush pending operations
    async fn flush(&mut self) -> Result<(), Self::Error>;
    
    /// Get index statistics
    async fn stats(&self) -> Result<IndexStats, Self::Error>;
}

pub struct Posting {
    pub document_id: u128,
    pub start: u32,
    pub length: u32,
    pub vector: Vec<f32>,
}

pub struct SearchResult {
    pub document_id: u128,
    pub start: u32,
    pub length: u32,
    pub vector: Vec<f32>,
    pub similarity_score: f32,
}

pub struct IndexStats {
    pub total_shards: usize,
    pub total_postings: usize,
    pub pending_operations: usize,
    pub memory_usage: usize,
}
```

### Error Types
```rust
#[derive(Debug, Error)]
pub enum ShardexError {
    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),
    #[error("Invalid vector dimension: expected {expected}, got {actual}")]
    InvalidDimension { expected: usize, actual: usize },
    #[error("Index corruption detected: {0}")]
    Corruption(String),
    #[error("Configuration error: {0}")]
    Config(String),
}
```

### Configuration API
```rust
pub struct ShardexConfig {
    pub directory_path: PathBuf,
    pub vector_size: usize,
    pub shard_size: usize,
    pub shardex_segment_size: usize,
    pub wal_segment_size: usize,
    pub batch_write_interval_ms: u64,
    pub default_slop_factor: usize,
    pub bloom_filter_size: usize,
}

impl Default for ShardexConfig {
    fn default() -> Self {
        Self {
            directory_path: PathBuf::from("./shardex_index"),
            vector_size: 384,
            shard_size: 10000,
            shardex_segment_size: 1000,
            wal_segment_size: 1024 * 1024, // 1MB
            batch_write_interval_ms: 100,
            default_slop_factor: 3,
            bloom_filter_size: 1024,
        }
    }
}

impl ShardexConfig {
    pub fn new() -> Self {
        Self::default()
    }
    
    pub fn directory_path<P: Into<PathBuf>>(mut self, path: P) -> Self {
        self.directory_path = path.into();
        self
    }
    
    pub fn vector_size(mut self, size: usize) -> Self {
        self.vector_size = size;
        self
    }
    
    pub fn shard_size(mut self, size: usize) -> Self {
        self.shard_size = size;
        self
    }
    
    pub fn shardex_segment_size(mut self, size: usize) -> Self {
        self.shardex_segment_size = size;
        self
    }
    
    pub fn wal_segment_size(mut self, size: usize) -> Self {
        self.wal_segment_size = size;
        self
    }
    
    pub fn batch_write_interval_ms(mut self, ms: u64) -> Self {
        self.batch_write_interval_ms = ms;
        self
    }
    
    pub fn default_slop_factor(mut self, factor: usize) -> Self {
        self.default_slop_factor = factor;
        self
    }
    
    pub fn bloom_filter_size(mut self, size: usize) -> Self {
        self.bloom_filter_size = size;
        self
    }
}
```

### Create vs Open Behavior
- `Shardex::create(config)`: Creates new index with specified configuration
  - If index already exists with matching configuration, effectively acts as `open`
  - If index exists with different configuration, returns error
- `Shardex::open(path)`: Opens existing index, reads configuration from metadata
  - Returns error if index doesn't exist

## Persistence & Storage

### File Layout
```
shardex_index/
├── shardex.meta          # Index metadata and configuration
├── centroids/            # Shardex segments (centroids + metadata + bloom filters)
│   ├── segment_000001.shx
│   ├── segment_000002.shx
│   └── ...
├── shards/              # Individual shard data
│   ├── {shard_ulid}.vectors
│   ├── {shard_ulid}.postings
│   └── ...
└── wal/                 # Write-ahead log segments
    ├── wal_000001.log
    ├── wal_000002.log
    └── ...
```

### Storage Format
- All files are fixed size given that they are memory mapped (shards, Shardex segments, and WAL)
- Direct representation memory maps - no serializing or deserializing
- Structures mapped directly from file bytes to memory
- Fixed-size layouts for direct pointer arithmetic
- Little-endian byte ordering for cross-platform compatibility
- Minimal headers with magic bytes and version information
- Data integrity via checksums but no complex serialization overhead

### Backup Procedures
- Copy-on-write enables consistent snapshots
- Incremental backup via WAL segment tracking
- Metadata backup for recovery coordination

## Monitoring & Observability

### Metrics Collection
- Search latency percentiles (p50, p95, p99)
- Throughput (operations/second)
- Shard distribution and balance
- Memory usage per component
- WAL segment utilization
- Bloom filter hit rates

### Logging Strategy
- Structured logging with trace/debug/info/warn/error levels
- Performance-sensitive paths use conditional compilation
- Async logging to prevent blocking operations
- Log rotation and retention policies

### Health Checks
- Index consistency verification
- Memory usage thresholds
- Disk space monitoring
- WAL replay capability

### Diagnostics
- Shard statistics and distribution analysis
- Vector quality metrics (centroid stability)
- Performance bottleneck identification

## Resource Management

### Memory Usage Bounds
- Configurable limits on in-memory structures
- Memory-mapped file size limits
- WAL buffer size constraints
- Search result set size limits

### File Descriptor Management
- Pool management for shard files
- Lazy loading and unloading strategies
- Operating system limits consideration
- Resource cleanup on shutdown

### Cleanup Procedures
- Automated garbage collection of deleted segments
- WAL segment archival and deletion
- Temporary file cleanup on crash recovery
- Memory deallocation strategies

## Edge Cases & Error Scenarios

### Corruption Detection
- Checksums on all data structures
- Magic number validation
- Structural integrity checks
- Cross-reference validation between components

### Recovery Procedures
- WAL replay with idempotency
- Partial corruption recovery strategies
- Fallback to previous consistent state
- Manual recovery tools and procedures

### Resource Exhaustion
- Out of disk space handling
- Memory pressure response
- File descriptor limit mitigation
- Network timeout handling

### Input Validation
- Vector dimension consistency checks
- Document ID format validation
- Parameter range validation
- Malformed data rejection

## Performance Tuning

### Benchmarking Methodology
- Synthetic workload generation
- Real-world usage pattern simulation
- Performance regression testing
- Memory usage profiling

### Optimization Strategies
- Memory layout optimization (cache-friendly access)
- SIMD vector operations where applicable
- Prefetching strategies for memory-mapped files
- Batch operation optimization

### Tuning Parameters
- Shard size vs search performance trade-offs
- Bloom filter false positive rate tuning
- WAL batch size optimization
- Centroid update frequency

## Deployment Considerations

### Single Process Design
- Each index can only be used by a single process
- File locking to prevent multiple process access
- Process crash recovery for orphaned locks
- No shared memory coordination needed

### Hot Backup Procedures
- Online backup without service interruption
- Consistent snapshot creation
- Backup validation and testing
- Restore time optimization

### Version Migration
- Backward compatibility strategies
- Data format migration tools
- Rolling upgrade procedures
- Rollback capabilities

### Production Readiness
- Load testing procedures
- Capacity planning guidelines
- Monitoring and alerting setup
- Operational runbooks

## Configuration Parameters

### Shard Configuration
- Maximum entries per shard
- Embedding vector dimensions
- Splitting threshold

### Performance Tuning
- Batch write interval (milliseconds)
- Search slop factor
- Bloom filter size parameters

### System Parameters
- WAL size limits
- Recovery timeout settings
- Memory mapping options

---
*Last updated: 2025-08-22*