shardex 0.1.0

A high-performance memory-mapped vector search engine with ACID transactions and incremental updates
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
# Migrating to Document Text Storage

This guide helps existing Shardex users migrate to document text storage functionality introduced in recent versions.

## Compatibility Overview

Document text storage is designed with **full backward compatibility**:

- ✅ Existing indexes work unchanged without text storage
- ✅ No breaking changes to existing APIs
- ✅ Text storage is opt-in via configuration
- ✅ All existing code continues to function exactly as before

## Quick Start: Enabling Text Storage

### For New Projects

Simply add `max_document_text_size` to your configuration:

```rust
let config = ShardexConfig::new()
    .directory_path("./my_index")
    .vector_size(128)
    .max_document_text_size(10 * 1024 * 1024); // Enable with 10MB limit

let mut index = ShardexImpl::create(config).await?;
```

### For Existing Projects

Update your existing configuration - no code changes required:

```rust
// Your existing code works unchanged
let config = ShardexConfig::new()
    .directory_path("./existing_index")
    .vector_size(384)
    .shard_size(50000)
    // Add this line to enable text storage
    .max_document_text_size(5 * 1024 * 1024); // 5MB per document

// All your existing operations continue to work
index.add_postings(postings).await?;
let results = index.search(&query, 10, None).await?;
```

## Migration Strategies

### Strategy 1: Immediate Full Migration

**Best for**: New features, clean rewrites, or when you can update all code at once.

Replace existing document operations with atomic replacement:

```rust
// OLD approach (still works, but not atomic)
index.remove_documents(vec![doc_id]).await?;
index.add_postings(new_postings).await?;

// NEW approach (atomic operation with text storage)
index.replace_document_with_postings(doc_id, text, new_postings).await?;
```

**Benefits:**
- Atomic operations ensure data consistency
- Enables rich text extraction features
- Cleaner code with fewer operations

**Migration steps:**
1. Update your configuration to enable text storage
2. Replace `remove_documents` + `add_postings` with `replace_document_with_postings`
3. Add text extraction to your search results processing

### Strategy 2: Gradual Migration

**Best for**: Large codebases, production systems, or when you want to test incrementally.

Keep existing code unchanged, add text storage only for new documents:

```rust
// Existing documents continue to work (no text storage)
index.add_postings(legacy_postings).await?;
let legacy_results = index.search(&query, 10, None).await?;

// New documents can include text storage
index.replace_document_with_postings(new_doc_id, text, postings).await?;

// Handle mixed results
for result in search_results {
    match index.get_document_text(result.document_id).await {
        Ok(text) => println!("Document with text: {}", text),
        Err(ShardexError::DocumentTextNotFound { .. }) => {
            println!("Legacy document without text");
        }
        Err(e) => eprintln!("Error: {}", e),
    }
}
```

**Benefits:**
- Zero risk to existing functionality
- Incremental rollout and testing
- Easy rollback if needed

### Strategy 3: Feature-Flag Approach

**Best for**: Applications with feature toggles or A/B testing infrastructure.

Use runtime configuration to control text storage usage:

```rust
struct AppConfig {
    enable_text_storage: bool,
    max_text_size: usize,
}

async fn store_document(
    index: &mut ShardexImpl,
    doc_id: DocumentId,
    text: Option<String>,
    postings: Vec<Posting>,
    config: &AppConfig,
) -> Result<(), ShardexError> {
    if config.enable_text_storage && text.is_some() {
        // New path: atomic replacement with text
        index.replace_document_with_postings(doc_id, text.unwrap(), postings).await
    } else {
        // Legacy path: postings only
        index.add_postings(postings).await
    }
}
```

## Detailed Migration Guide

### Step 1: Update Dependencies

Ensure you're using a version of Shardex that supports document text storage:

```toml
[dependencies]
shardex = "0.2" # Or later version with text storage support
```

### Step 2: Update Configuration

Add text storage configuration to your setup:

```rust
let config = ShardexConfig::new()
    .directory_path("./my_index")
    .vector_size(384)
    // Add these configuration options:
    .max_document_text_size(10 * 1024 * 1024) // 10MB per document
    .shard_size(50000) // Existing config continues to work
    .batch_write_interval_ms(100); // Existing config continues to work
```

**Configuration recommendations by use case:**

```rust
// Chat/Messaging (small, frequent updates)
.max_document_text_size(16 * 1024) // 16KB per message
.batch_write_interval_ms(25) // Low latency

// Document Search (medium documents)
.max_document_text_size(1024 * 1024) // 1MB per document  
.batch_write_interval_ms(50) // Balanced

// Academic Papers (large documents)
.max_document_text_size(50 * 1024 * 1024) // 50MB per document
.batch_write_interval_ms(100) // Throughput optimized
```

### Step 3: Update Document Storage Logic

Choose your migration approach:

#### Option A: Full Migration
```rust
// Before
async fn update_document(
    index: &mut ShardexImpl, 
    doc_id: DocumentId, 
    new_postings: Vec<Posting>
) -> Result<(), ShardexError> {
    index.remove_documents(vec![doc_id]).await?;
    index.add_postings(new_postings).await?;
    index.flush().await
}

// After  
async fn update_document(
    index: &mut ShardexImpl,
    doc_id: DocumentId, 
    text: String,
    new_postings: Vec<Posting>
) -> Result<(), ShardexError> {
    index.replace_document_with_postings(doc_id, text, new_postings).await
    // Note: No separate flush() needed - operation is already atomic
}
```

#### Option B: Gradual Migration
```rust
async fn store_document(
    index: &mut ShardexImpl,
    doc_id: DocumentId,
    text: Option<String>,
    postings: Vec<Posting>
) -> Result<(), ShardexError> {
    match text {
        Some(document_text) => {
            // Use new atomic operation for documents with text
            index.replace_document_with_postings(doc_id, document_text, postings).await
        }
        None => {
            // Use legacy approach for documents without text
            index.add_postings(postings).await
        }
    }
}
```

### Step 4: Update Search Result Processing

Add text extraction to your search result handling:

```rust
// Before
async fn process_search_results(
    index: &ShardexImpl,
    results: Vec<SearchResult>
) -> Result<(), ShardexError> {
    for result in results {
        println!("Document {} (score: {:.3})", 
                 result.document_id.to_u128(), 
                 result.similarity_score);
    }
    Ok(())
}

// After
async fn process_search_results(
    index: &ShardexImpl,
    results: Vec<SearchResult>
) -> Result<(), ShardexError> {
    for result in results {
        // Create posting for text extraction
        let posting = Posting {
            document_id: result.document_id,
            start: result.start,
            length: result.length,
            vector: result.vector,
        };
        
        // Try to extract text snippet
        match index.extract_text(&posting).await {
            Ok(text_snippet) => {
                println!("Document {} (score: {:.3}): '{}'",
                         result.document_id.to_u128(),
                         result.similarity_score,
                         text_snippet);
            }
            Err(ShardexError::DocumentTextNotFound { .. }) => {
                // Handle legacy documents without text gracefully
                println!("Document {} (score: {:.3}) [no text available]",
                         result.document_id.to_u128(),
                         result.similarity_score);
            }
            Err(e) => {
                eprintln!("Text extraction error: {}", e);
            }
        }
    }
    Ok(())
}
```

### Step 5: Add Error Handling

Update your error handling to cover text-specific errors:

```rust
use shardex::ShardexError;

async fn safe_text_extraction(
    index: &ShardexImpl,
    posting: &Posting
) -> Option<String> {
    match index.extract_text(posting).await {
        Ok(text) => Some(text),
        Err(ShardexError::DocumentTextNotFound { document_id }) => {
            eprintln!("No text stored for document {}", document_id);
            None
        }
        Err(ShardexError::InvalidRange { start, length, document_length }) => {
            eprintln!("Invalid coordinates {}..{} for document length {}", 
                     start, start + length, document_length);
            None
        }
        Err(ShardexError::DocumentTooLarge { size, max_size }) => {
            eprintln!("Document {} bytes exceeds limit {} bytes", size, max_size);
            None  
        }
        Err(e) => {
            eprintln!("Unexpected error: {}", e);
            None
        }
    }
}
```

## Migration Checklist

### Pre-Migration
- [ ] Update Shardex to a version with text storage support
- [ ] Choose migration strategy (immediate, gradual, or feature-flag)
- [ ] Determine appropriate `max_document_text_size` for your use case
- [ ] Back up existing indexes (if valuable data)

### During Migration
- [ ] Update configuration to enable text storage
- [ ] Update document storage logic (based on chosen strategy)
- [ ] Update search result processing to handle text extraction
- [ ] Add proper error handling for text-specific errors
- [ ] Test with a small subset of data first

### Post-Migration
- [ ] Verify all existing functionality still works
- [ ] Test new text storage features
- [ ] Monitor performance impact (should be minimal for vector operations)
- [ ] Update monitoring/logging to track text storage usage
- [ ] Update documentation for your team

## Performance Considerations

### Memory Usage
- **Text storage overhead**: ~32 bytes per document + actual text size
- **Memory mapping**: OS manages paging, no significant memory impact
- **Vector operations**: No performance change to existing vector search

### Disk Usage
- **Text files**: Additional `text_index.dat` and `text_data.dat` files
- **Estimate**: roughly 1.1x the total size of your document texts
- **Growth**: Files grow as documents are added/updated (old versions retained until compaction)

### Search Performance
- **Vector search**: No change - text storage doesn't affect vector operations
- **Text extraction**: Very fast - O(1) memory-mapped access after index lookup
- **Mixed queries**: Handle text-enabled and text-disabled documents seamlessly

## Testing Your Migration

### Unit Tests
```rust
#[tokio::test]
async fn test_migration_compatibility() -> Result<(), Box<dyn std::error::Error>> {
    let config = ShardexConfig::new()
        .directory_path("./test_migration")
        .vector_size(128)
        .max_document_text_size(1024);
    
    let mut index = ShardexImpl::create(config).await?;
    
    // Test 1: Legacy posting operations still work
    let legacy_posting = Posting {
        document_id: DocumentId::from_u128(1),
        start: 0,
        length: 100,
        vector: vec![0.1; 128],
    };
    
    index.add_postings(vec![legacy_posting.clone()]).await?;
    let results = index.search(&vec![0.1; 128], 1, None).await?;
    assert!(!results.is_empty());
    
    // Test 2: New text storage operations work
    let doc_text = "Test document with text";
    let text_posting = Posting {
        document_id: DocumentId::from_u128(2),
        start: 0,
        length: doc_text.len() as u32,
        vector: vec![0.2; 128],
    };
    
    index.replace_document_with_postings(
        text_posting.document_id,
        doc_text.to_string(),
        vec![text_posting]
    ).await?;
    
    let retrieved_text = index.get_document_text(DocumentId::from_u128(2)).await?;
    assert_eq!(retrieved_text, doc_text);
    
    // Test 3: Mixed results handling
    let all_results = index.search(&vec![0.15; 128], 5, None).await?;
    
    for result in all_results {
        match index.get_document_text(result.document_id).await {
            Ok(text) => println!("Document with text: {}", text),
            Err(ShardexError::DocumentTextNotFound { .. }) => {
                println!("Legacy document: {}", result.document_id.to_u128());
            }
            Err(e) => panic!("Unexpected error: {}", e),
        }
    }
    
    Ok(())
}
```

### Integration Tests
```rust
#[tokio::test]
async fn test_large_scale_migration() -> Result<(), Box<dyn std::error::Error>> {
    let config = ShardexConfig::new()
        .directory_path("./test_large_migration")
        .vector_size(256)
        .max_document_text_size(10 * 1024); // 10KB per doc
    
    let mut index = ShardexImpl::create(config).await?;
    
    // Add 1000 legacy documents
    let mut legacy_postings = Vec::new();
    for i in 0..1000 {
        legacy_postings.push(Posting {
            document_id: DocumentId::from_u128(i),
            start: 0,
            length: 100,
            vector: generate_test_vector(i as f32, 256),
        });
    }
    
    index.add_postings(legacy_postings).await?;
    
    // Add 1000 text-enabled documents
    for i in 1000..2000 {
        let doc_text = format!("Document {} with text content for testing", i);
        let posting = Posting {
            document_id: DocumentId::from_u128(i),
            start: 0,
            length: doc_text.len() as u32,
            vector: generate_test_vector(i as f32, 256),
        };
        
        index.replace_document_with_postings(
            posting.document_id,
            doc_text,
            vec![posting]
        ).await?;
    }
    
    // Test search across both types
    let query = generate_test_vector(1500.0, 256);
    let results = index.search(&query, 50, None).await?;
    
    let mut text_enabled_count = 0;
    let mut legacy_count = 0;
    
    for result in results {
        match index.get_document_text(result.document_id).await {
            Ok(_) => text_enabled_count += 1,
            Err(ShardexError::DocumentTextNotFound { .. }) => legacy_count += 1,
            Err(e) => panic!("Unexpected error: {}", e),
        }
    }
    
    println!("Mixed results: {} with text, {} legacy", text_enabled_count, legacy_count);
    assert!(text_enabled_count > 0);
    assert!(legacy_count > 0);
    
    Ok(())
}

fn generate_test_vector(seed: f32, dimension: usize) -> Vec<f32> {
    (0..dimension).map(|i| (seed + i as f32) / dimension as f32).collect()
}
```

## Rollback Procedures

If you need to rollback your migration:

### Option 1: Configuration Rollback
Simply set `max_document_text_size = 0` to disable text storage:

```rust
let config = ShardexConfig::new()
    .directory_path("./my_index")
    .vector_size(384)
    .max_document_text_size(0); // Disable text storage
```

All text storage operations will return `DocumentTextNotFound` errors, but vector operations continue normally.

### Option 2: Code Rollback
Revert to previous version of your code:

```rust
// Remove text-related operations
// index.replace_document_with_postings(...) 
// becomes:
index.add_postings(postings).await?;

// Remove text extraction from search results
// Keep only the similarity scores and document IDs
```

### Option 3: Full Index Rebuild
If you want to completely remove text storage:

1. Export all vector data from existing index
2. Create new index with `max_document_text_size = 0`
3. Re-import only the vector data
4. Text storage files can be deleted

## Common Issues and Solutions

### Issue: "Document text not found" errors after migration
**Cause**: Trying to access text for documents added before text storage was enabled.

**Solution**: Handle the error gracefully:
```rust
match index.get_document_text(doc_id).await {
    Ok(text) => println!("Text: {}", text),
    Err(ShardexError::DocumentTextNotFound { .. }) => {
        println!("Legacy document - no text available");
    }
    Err(e) => eprintln!("Error: {}", e),
}
```

### Issue: Performance degradation after enabling text storage
**Cause**: Usually not caused by text storage itself, but by changes in usage patterns.

**Solutions:**
1. Monitor actual memory usage - text storage uses memory-mapped files
2. Check if you're accidentally reading large documents frequently
3. Verify your `max_document_text_size` is appropriate
4. Use text extraction selectively, not for every search result

### Issue: "Document too large" errors
**Cause**: Documents exceed the `max_document_text_size` limit.

**Solutions:**
```rust
// Increase limit if appropriate
.max_document_text_size(50 * 1024 * 1024) // 50MB

// Or split large documents into smaller chunks
fn split_large_document(text: &str, max_size: usize) -> Vec<String> {
    text.chars()
        .collect::<Vec<char>>()
        .chunks(max_size)
        .map(|chunk| chunk.iter().collect())
        .collect()
}
```

### Issue: WAL files growing large after migration
**Cause**: Document replacement operations can generate more WAL entries.

**Solution**: Ensure regular flushing:
```rust
// Flush periodically for long-running operations
if batch_count % 100 == 0 {
    index.flush().await?;
}
```

## Best Practices

1. **Start Small**: Test with a small subset of your data first
2. **Monitor Performance**: Watch memory usage and search latency during migration
3. **Graceful Degradation**: Always handle `DocumentTextNotFound` errors gracefully
4. **Atomic Operations**: Use `replace_document_with_postings` for consistency
5. **Size Limits**: Set appropriate document size limits for your use case
6. **Error Handling**: Implement comprehensive error handling for all text operations
7. **Testing**: Test both text-enabled and legacy document scenarios
8. **Documentation**: Update your team's documentation about the new capabilities

## Getting Help

If you encounter issues during migration:

1. **Check Error Messages**: Text storage errors are descriptive and include suggestions
2. **Review Examples**: Look at `examples/document_text_*.rs` for working code
3. **Test Incrementally**: Migrate small portions at a time to isolate issues
4. **Community Support**: File issues on GitHub with minimal reproduction cases
5. **Rollback Plan**: Always have a rollback strategy ready

## Summary

Document text storage migration is designed to be:
- **Safe**: Full backward compatibility with existing code
- **Gradual**: Multiple migration strategies to fit your needs  
- **Robust**: Comprehensive error handling and recovery procedures
- **Flexible**: Enable/disable at configuration level

Choose the migration strategy that best fits your project's constraints and requirements. The gradual migration approach is recommended for production systems, while immediate migration works well for new features or development environments.