shardex 0.1.0 - Docs.rs

# Shardex API Reference

This document provides detailed API reference for all public types, traits, and functions in Shardex.

## Core Traits and Types

### `Shardex` Trait

The main interface for vector search operations.

```rust
#[async_trait]
pub trait Shardex {
    type Error;

    async fn create(config: ShardexConfig) -> Result<Self, Self::Error>;
    async fn open<P: AsRef<Path> + Send>(directory_path: P) -> Result<Self, Self::Error>;
    async fn add_postings(&mut self, postings: Vec<Posting>) -> Result<(), Self::Error>;
    async fn remove_documents(&mut self, document_ids: Vec<u128>) -> Result<(), Self::Error>;
    async fn search(
        &self,
        query_vector: &[f32],
        k: usize,
        slop_factor: Option<usize>,
    ) -> Result<Vec<SearchResult>, Self::Error>;
    async fn search_with_metric(
        &self,
        query_vector: &[f32],
        k: usize,
        metric: DistanceMetric,
        slop_factor: Option<usize>,
    ) -> Result<Vec<SearchResult>, Self::Error>;
    async fn flush(&mut self) -> Result<(), Self::Error>;
    async fn flush_with_stats(&mut self) -> Result<FlushStats, Self::Error>;
    async fn stats(&self) -> Result<IndexStats, Self::Error>;
    async fn detailed_stats(&self) -> Result<DetailedIndexStats, Self::Error>;

    // Document Text Storage Methods
    async fn get_document_text(&self, document_id: DocumentId) -> Result<String, Self::Error>;
    async fn extract_text(&self, posting: &Posting) -> Result<String, Self::Error>;
    async fn replace_document_with_postings(
        &mut self,
        document_id: DocumentId,
        text: String,
        postings: Vec<Posting>,
    ) -> Result<(), Self::Error>;
}
```

#### Methods

##### `create(config: ShardexConfig) -> Result<Self, Self::Error>`

Creates a new index with the specified configuration.

**Parameters:**
- `config`: Configuration for the new index

**Returns:**
- `Ok(Self)`: Successfully created index
- `Err(ShardexError)`: Configuration error, I/O error, or other failure

**Example:**
```rust
let config = ShardexConfig::new()
    .directory_path("./my_index")
    .vector_size(384);
let index = ShardexImpl::create(config).await?;
```

##### `open<P: AsRef<Path> + Send>(directory_path: P) -> Result<Self, Self::Error>`

Opens an existing index from the specified directory.

**Parameters:**
- `directory_path`: Path to the index directory

**Returns:**
- `Ok(Self)`: Successfully opened index
- `Err(ShardexError)`: Index not found, corruption, or I/O error

**Example:**
```rust
let index = ShardexImpl::open("./existing_index").await?;
```

##### `add_postings(&mut self, postings: Vec<Posting>) -> Result<(), Self::Error>`

Adds multiple postings to the index in a batch operation.

**Parameters:**
- `postings`: Vector of postings to add

**Returns:**
- `Ok(())`: Successfully queued postings for addition
- `Err(ShardexError)`: Invalid dimensions, I/O error, or other failure

**Example:**
```rust
let postings = vec![
    Posting {
        document_id: DocumentId::from_u128(1),
        start: 0,
        length: 100,
        vector: vec![0.1; 384],
    }
];
index.add_postings(postings).await?;
```

##### `remove_documents(&mut self, document_ids: Vec<u128>) -> Result<(), Self::Error>`

Removes all postings for the specified documents.

**Parameters:**
- `document_ids`: List of document IDs to remove

**Returns:**
- `Ok(())`: Successfully queued documents for removal
- `Err(ShardexError)`: I/O error or other failure

**Example:**
```rust
index.remove_documents(vec![1, 2, 3]).await?;
```

##### `search(&self, query_vector: &[f32], k: usize, slop_factor: Option<usize>) -> Result<Vec<SearchResult>, Self::Error>`

Searches for the k most similar vectors using cosine similarity.

**Parameters:**
- `query_vector`: The query vector to search for
- `k`: Number of results to return
- `slop_factor`: Optional search breadth factor (default uses config value)

**Returns:**
- `Ok(Vec<SearchResult>)`: Search results sorted by similarity (highest first)
- `Err(ShardexError)`: Invalid dimensions, I/O error, or other failure

**Example:**
```rust
let query = vec![0.1; 384];
let results = index.search(&query, 10, None).await?;
```

##### `flush(&mut self) -> Result<(), Self::Error>`

Flushes all pending operations to disk.

**Returns:**
- `Ok(())`: Successfully flushed all operations
- `Err(ShardexError)`: I/O error or other failure

**Example:**
```rust
index.add_postings(postings).await?;
index.flush().await?; // Ensure data is written
```

##### `stats(&self) -> Result<IndexStats, Self::Error>`

Returns basic index statistics.

**Returns:**
- `Ok(IndexStats)`: Current index statistics
- `Err(ShardexError)`: I/O error or other failure

**Example:**
```rust
let stats = index.stats().await?;
println!("Index has {} documents", stats.total_postings);
```

##### `get_document_text(&self, document_id: DocumentId) -> Result<String, Self::Error>`

Retrieves the complete current text for a document.

**Parameters:**
- `document_id`: The document identifier

**Returns:**
- `Ok(String)`: The complete document text
- `Err(ShardexError)`: Error if document not found or text storage disabled

**Example:**
```rust
let document_text = index.get_document_text(document_id).await?;
println!("Document content: {}", document_text);
```

##### `extract_text(&self, posting: &Posting) -> Result<String, Self::Error>`

Extracts text substring using posting coordinates (document_id, start, length).

**Parameters:**
- `posting`: Posting containing document coordinates

**Returns:**
- `Ok(String)`: Extracted text substring
- `Err(ShardexError)`: Error if coordinates invalid or document not found

**Example:**
```rust
// From search results
let search_results = index.search(&query_vector, 10, None).await?;
for result in search_results {
    let posting = Posting {
        document_id: result.document_id,
        start: result.start,
        length: result.length,
        vector: result.vector,
    };
    
    let text_snippet = index.extract_text(&posting).await?;
    println!("Found: '{}'", text_snippet);
}
```

##### `replace_document_with_postings(&mut self, document_id: DocumentId, text: String, postings: Vec<Posting>) -> Result<(), Self::Error>`

Atomically replaces document text and all associated postings in a single transaction.

**Parameters:**
- `document_id`: Document identifier
- `text`: New document text content
- `postings`: New postings for the document

**Returns:**
- `Ok(())`: Operation completed successfully
- `Err(ShardexError)`: Error during atomic replacement

**Example:**
```rust
let document_text = "The quick brown fox jumps over the lazy dog.";
let postings = vec![
    Posting {
        document_id: doc_id,
        start: 0,
        length: 9, // "The quick"
        vector: embedding1,
    },
    Posting {
        document_id: doc_id,
        start: 10,
        length: 9, // "brown fox"
        vector: embedding2,
    },
    Posting {
        document_id: doc_id,
        start: 20,
        length: 5, // "jumps"
        vector: embedding3,
    },
];

index.replace_document_with_postings(doc_id, document_text.to_string(), postings).await?;
```

## Data Structures

### `Posting`

Represents a document posting with its vector embedding.

```rust
pub struct Posting {
    pub document_id: DocumentId,
    pub start: u32,
    pub length: u32,
    pub vector: Vec<f32>,
}
```

**Fields:**
- `document_id`: Unique identifier for the document
- `start`: Starting byte position within the document
- `length`: Length of the text segment in bytes
- `vector`: Vector embedding for this posting segment

### `SearchResult`

A search result containing a posting with similarity score.

```rust
pub struct SearchResult {
    pub document_id: DocumentId,
    pub start: u32,
    pub length: u32,
    pub vector: Vec<f32>,
    pub similarity_score: f32,
}
```

**Fields:**
- `document_id`: Unique identifier for the document
- `start`: Starting byte position within the document
- `length`: Length of the text segment in bytes
- `vector`: Vector embedding for this result segment
- `similarity_score`: Similarity score (higher means more similar)

### `IndexStats`

Basic index statistics for monitoring.

```rust
pub struct IndexStats {
    pub total_shards: usize,
    pub total_postings: usize,
    pub pending_operations: usize,
    pub memory_usage: usize,
    pub active_postings: usize,
    pub deleted_postings: usize,
    pub average_shard_utilization: f32,
    pub vector_dimension: usize,
    pub disk_usage: usize,
    pub search_latency_p50: Duration,
    pub search_latency_p95: Duration,
    pub search_latency_p99: Duration,
    pub write_throughput: f32,
    pub bloom_filter_hit_rate: f32,
}
```

### `DetailedIndexStats`

Comprehensive index statistics with performance metrics.

```rust
pub struct DetailedIndexStats {
    pub total_shards: usize,
    pub total_postings: usize,
    pub active_postings: usize,
    pub deleted_postings: usize,
    pub vector_dimension: usize,
    pub memory_usage: usize,
    pub disk_usage: usize,
    pub average_shard_utilization: f32,
    pub bloom_filter_stats: Option<BloomFilterStats>,
    // Additional performance metrics...
}
```

## Configuration

### `ShardexConfig`

Configuration for creating and tuning Shardex indexes.

```rust
pub struct ShardexConfig {
    pub directory_path: PathBuf,
    pub vector_size: usize,
    pub shard_size: usize,
    pub shardex_segment_size: usize,
    pub wal_segment_size: usize,
    pub batch_write_interval_ms: u64,
    pub default_slop_factor: usize,
    pub bloom_filter_size: usize,
    pub max_document_text_size: usize,
}
```

#### Constructor and Builder Methods

```rust
impl ShardexConfig {
    pub fn new() -> Self;
    pub fn directory_path<P: Into<PathBuf>>(mut self, path: P) -> Self;
    pub fn vector_size(mut self, size: usize) -> Self;
    pub fn shard_size(mut self, size: usize) -> Self;
    pub fn shardex_segment_size(mut self, size: usize) -> Self;
    pub fn wal_segment_size(mut self, size: usize) -> Self;
    pub fn batch_write_interval_ms(mut self, ms: u64) -> Self;
    pub fn default_slop_factor(mut self, factor: usize) -> Self;
    pub fn bloom_filter_size(mut self, size: usize) -> Self;
    pub fn max_document_text_size(mut self, size: usize) -> Self;
}
```

#### Configuration Parameters

##### `directory_path: PathBuf`

Directory where index files will be stored.

**Default:** `"./shardex_index"`

**Example:**
```rust
.directory_path("/var/lib/myapp/search_index")
```

##### `vector_size: usize`

Number of dimensions in each vector. Must be consistent across all vectors.

**Default:** `384`

**Example:**
```rust
.vector_size(768) // For BERT-large embeddings
```

##### `shard_size: usize`

Maximum number of vectors per shard before splitting.

**Default:** `10000`

**Trade-offs:**
- Smaller values: Lower memory per search, more frequent splits
- Larger values: Higher memory per search, fewer splits

**Example:**
```rust
.shard_size(50000) // For high-throughput scenarios
```

##### `batch_write_interval_ms: u64`

Interval between WAL batch processing operations.

**Default:** `100ms`

**Trade-offs:**
- Smaller values: Lower latency, higher CPU overhead
- Larger values: Higher latency, better throughput

**Example:**
```rust
.batch_write_interval_ms(50) // For low-latency applications
```

##### `default_slop_factor: usize`

Default number of shards to search when no slop factor is specified.

**Default:** `3`

**Trade-offs:**
- Smaller values: Faster search, potentially lower accuracy
- Larger values: Slower search, higher accuracy

**Example:**
```rust
.default_slop_factor(5) // For high-accuracy applications
```

##### `max_document_text_size: usize`

Maximum size in bytes for document text storage per document. Set to 0 to disable text storage.

**Default:** `0` (disabled)

**Trade-offs:**
- 0: Text storage disabled, minimal overhead
- Small values: Memory efficient, may limit document sizes
- Large values: Supports large documents, higher memory usage

**Example:**
```rust
.max_document_text_size(10 * 1024 * 1024) // 10MB per document
.max_document_text_size(0) // Disable text storage
```

## Error Handling

### `ShardexError`

Comprehensive error type for all Shardex operations.

```rust
#[derive(Debug, Error)]
pub enum ShardexError {
    #[error("IO error: {0}")]
    Io(#[from] std::io::Error),
    
    #[error("Invalid vector dimension: expected {expected}, got {actual}")]
    InvalidDimension { expected: usize, actual: usize },
    
    #[error("Index corruption detected: {0}")]
    Corruption(String),
    
    #[error("Configuration error: {0}")]
    Config(String),
    
    #[error("Document text not found for document ID: {document_id}")]
    DocumentTextNotFound { document_id: String },
    
    #[error("Invalid range: start={start}, length={length} for document length {document_length}")]
    InvalidRange { start: u32, length: u32, document_length: u64 },
    
    #[error("Document text size {size} exceeds maximum allowed {max_size}")]
    DocumentTooLarge { size: usize, max_size: usize },
    
    #[error("Text storage corruption detected: {details}")]
    TextCorruption { details: String },
}
```

#### Error Variants

##### `Io(std::io::Error)`

File system I/O errors (permissions, disk space, etc.).

**Common Causes:**
- Insufficient disk space
- Permission denied
- Directory not found
- File system errors

##### `InvalidDimension { expected: usize, actual: usize }`

Vector dimension mismatch.

**Common Causes:**
- Adding vectors with wrong dimension
- Query vector has wrong dimension
- Configuration mismatch

##### `Corruption(String)`

Index corruption detected.

**Common Causes:**
- Unexpected shutdown during write
- File system corruption
- Manual file modification

##### `Config(String)`

Invalid configuration parameters.

**Common Causes:**
- Zero or negative values for required parameters
- Invalid file paths
- Incompatible parameter combinations

##### `DocumentTextNotFound { document_id: String }`

Document text not available for the requested document ID.

**Common Causes:**
- Document ID does not exist
- Text storage is disabled (max_document_text_size = 0)
- Document was added without text storage

##### `InvalidRange { start: u32, length: u32, document_length: u64 }`

Text extraction coordinates are invalid for the document.

**Common Causes:**
- Start position beyond document end
- Length extends beyond document end
- Negative or invalid coordinates

##### `DocumentTooLarge { size: usize, max_size: usize }`

Document text exceeds the configured size limit.

**Common Causes:**
- Document larger than max_document_text_size
- Configuration needs adjustment for large documents

##### `TextCorruption { details: String }`

Text storage file corruption detected during read/write operations.

**Common Causes:**
- Unexpected shutdown during write operations
- File system corruption
- Manual modification of text storage files

## Identifiers

### `DocumentId`

128-bit document identifier type.

```rust
pub struct DocumentId(ulid::Ulid);

impl DocumentId {
    pub fn new() -> Self;
    pub fn from_u128(value: u128) -> Self;
    pub fn to_u128(&self) -> u128;
    pub fn from_ulid(ulid: ulid::Ulid) -> Self;
    pub fn to_ulid(&self) -> ulid::Ulid;
}
```

**Example:**
```rust
let doc_id = DocumentId::new(); // Generate new ID
let doc_id = DocumentId::from_u128(12345); // From integer
println!("Document ID: {}", doc_id.to_u128());
```

### `ShardId`

128-bit shard identifier type.

```rust
pub struct ShardId(ulid::Ulid);
```

Similar interface to `DocumentId` but used internally for shard management.

## Distance Metrics

### `DistanceMetric`

Supported distance/similarity metrics.

```rust
pub enum DistanceMetric {
    Cosine,
    Euclidean,
    DotProduct,
}
```

#### Metrics

##### `Cosine`

Cosine similarity (default). Good for normalized vectors.

Range: [-1, 1] (higher is more similar)

##### `Euclidean`

Euclidean distance. Good for spatial data.

Range: [0, ∞] (lower is more similar)

##### `DotProduct`

Dot product similarity. Good for magnitude-sensitive comparisons.

Range: [-∞, ∞] (higher is more similar)

**Example:**
```rust
use shardex::DistanceMetric;

let results = index.search_with_metric(
    &query_vector,
    10,
    DistanceMetric::Euclidean,
    Some(3)
).await?;
```

## Document Text Storage

Shardex supports storing and retrieving document source text alongside vector embeddings, enabling rich search experiences with text extraction.

### Overview

Document text storage is an optional feature that allows you to:

- Store complete document text with vector postings
- Extract text snippets from search results
- Atomically update documents and their postings
- Maintain text-to-vector coordinate mapping

### Configuration

Enable document text storage by setting `max_document_text_size` in your configuration:

```rust
let config = ShardexConfig {
    directory_path: "./my_index".into(),
    vector_size: 128,
    max_document_text_size: 10 * 1024 * 1024, // 10MB per document
    ..Default::default()
};
```

**Important:** Text storage is disabled by default (`max_document_text_size = 0`).

### File Layout

When text storage is enabled, additional files are created:

```
index_directory/
├── text_index.dat     # Document text index entries
├── text_data.dat      # Raw document text data
├── shard_0/           # Vector postings (unchanged)
├── shard_1/
└── wal.log           # Write-ahead log (includes text operations)
```

### Usage Patterns

#### Basic Document Storage

```rust
let document_text = "The quick brown fox jumps over the lazy dog.";
let doc_id = DocumentId::new();

let postings = vec![
    Posting {
        document_id: doc_id,
        start: 0,
        length: 9, // "The quick"
        vector: embedding1,
    },
    Posting {
        document_id: doc_id,
        start: 10,
        length: 9, // "brown fox"
        vector: embedding2,
    },
];

// Store document with text and postings atomically
index.replace_document_with_postings(doc_id, document_text.to_string(), postings).await?;
```

#### Text Retrieval

```rust
// Get complete document text
let full_text = index.get_document_text(doc_id).await?;

// Extract text from posting coordinates
let posting = Posting {
    document_id: doc_id,
    start: 0,
    length: 9,
    vector: query_vector,
};
let text_snippet = index.extract_text(&posting).await?; // "The quick"
```

#### Search with Text Extraction

```rust
let search_results = index.search(&query_vector, 10, None).await?;

for result in search_results {
    // Convert search result to posting for text extraction
    let posting = Posting {
        document_id: result.document_id,
        start: result.start,
        length: result.length,
        vector: result.vector,
    };
    
    let text_snippet = index.extract_text(&posting).await?;
    println!("Found: '{}' (score: {:.4})", text_snippet, result.similarity_score);
}
```

### Performance Characteristics

#### Memory Usage
- **Index overhead**: ~32 bytes per document
- **Text storage**: Actual text size + alignment padding
- **Memory mapping**: OS manages paging automatically

#### Lookup Performance
- **Text retrieval**: O(d) where d is number of document versions (typically small)
- **Text extraction**: O(1) after index lookup (memory-mapped access)
- **Search**: No performance impact on vector operations

#### Storage Efficiency
- **Append-only**: All document versions stored for ACID properties
- **Compaction**: Manual compaction removes old versions
- **Compression**: Not implemented (relies on file system compression)

### Best Practices

#### Document Size Management
```rust
// Configure appropriate size limits
let config = ShardexConfig::new()
    .max_document_text_size(50 * 1024 * 1024) // 50MB for large documents
    .directory_path("./large_doc_index");

// Handle oversized documents
match index.replace_document_with_postings(doc_id, large_text, postings).await {
    Err(ShardexError::DocumentTooLarge { size, max_size }) => {
        eprintln!("Document {} bytes exceeds limit {} bytes", size, max_size);
        // Consider splitting document or increasing limit
    }
    Ok(()) => println!("Document stored successfully"),
    Err(e) => eprintln!("Other error: {}", e),
}
```

#### Error Handling
```rust
// Comprehensive error handling for text operations
match index.extract_text(&posting).await {
    Ok(text) => println!("Extracted: {}", text),
    Err(ShardexError::DocumentTextNotFound { document_id }) => {
        eprintln!("No text stored for document {}", document_id);
    }
    Err(ShardexError::InvalidRange { start, length, document_length }) => {
        eprintln!("Invalid coordinates {}..{} for document length {}", 
                  start, start + length, document_length);
    }
    Err(ShardexError::TextCorruption { details }) => {
        eprintln!("Text corruption detected: {}", details);
        // Consider index recovery procedures
    }
    Err(e) => eprintln!("Unexpected error: {}", e),
}
```

#### Atomic Operations
```rust
// Always use atomic replacement for consistency
// DON'T do this (can lead to inconsistency):
// index.remove_documents(vec![doc_id]).await?;
// index.add_postings(new_postings).await?;

// DO this instead (atomic):
index.replace_document_with_postings(doc_id, updated_text, new_postings).await?;
```

### Migration and Compatibility

#### Backward Compatibility
- Indexes without text storage continue to work unchanged
- Text storage is opt-in via configuration
- No breaking changes to existing APIs

#### Adding Text Storage to Existing Indexes
```rust
// Existing index without text storage
let mut index = ShardexImpl::open("./existing_index").await?;

// Text storage methods will return appropriate errors
match index.get_document_text(doc_id).await {
    Err(ShardexError::DocumentTextNotFound { .. }) => {
        println!("Text storage not enabled for this document");
    }
    Ok(text) => println!("Found text: {}", text),
    Err(e) => eprintln!("Other error: {}", e),
}

// Enable text storage by creating new index with updated config
let new_config = ShardexConfig::new()
    .directory_path("./text_enabled_index")
    .max_document_text_size(10 * 1024 * 1024);
let mut new_index = ShardexImpl::create(new_config).await?;
```

## Advanced Types

### `FlushStats`

Statistics returned by flush operations.

```rust
pub struct FlushStats {
    pub operations_flushed: usize,
    pub shards_updated: usize,
    pub wal_segments_processed: usize,
    pub flush_duration: Duration,
}
```

### `BloomFilterStats`

Bloom filter performance statistics.

```rust
pub struct BloomFilterStats {
    pub total_queries: u64,
    pub hits: u64,
    pub misses: u64,
    pub hit_rate: f32,
    pub false_positive_rate: f32,
}
```

## Utility Functions

### Result Type Alias

```rust
pub type Result<T> = std::result::Result<T, ShardexError>;
```

Use this for cleaner error handling in your application code.

## Memory Management

### Memory-Mapped Types

Several types are designed for memory-mapped access:

#### `PostingHeader`

Memory-mapped compatible posting header.

```rust
#[repr(C)]
pub struct PostingHeader {
    pub document_id: DocumentId,
    pub start: u32,
    pub length: u32,
    pub vector_offset: u64,
    pub vector_len: u32,
}
```

#### `SearchResultHeader`

Memory-mapped compatible search result header.

```rust
#[repr(C)]
pub struct SearchResultHeader {
    pub document_id: DocumentId,
    pub start: u32,
    pub length: u32,
    pub vector_offset: u64,
    pub vector_len: u32,
    pub similarity_score: f32,
}
```

## Thread Safety

### Concurrent Access

Shardex types have specific thread safety guarantees:

- **`ShardexImpl`**: Not `Send` or `Sync` - single-threaded access only
- **`ShardexConfig`**: `Send + Sync` - safe to share across threads
- **`Posting`**: `Send + Sync` - safe to share across threads
- **`SearchResult`**: `Send + Sync` - safe to share across threads
- **All error types**: `Send + Sync` - safe to share across threads

For concurrent access patterns, see the `ConcurrentShardex` wrapper type.

## Performance Considerations

### Vector Operations

- All vector operations are optimized for f32 arrays
- SIMD instructions are used when available
- Memory layout is cache-friendly

### Memory Usage

Approximate memory usage calculation:
```
Memory ≈ (num_vectors × vector_dimensions × 4 bytes) × 1.5
```

The 1.5 multiplier accounts for:
- Posting metadata
- Index structures
- Bloom filters
- Operating overhead

### Disk Usage

Approximate disk usage calculation:
```
Disk ≈ Memory × 1.2
```

The multiplier accounts for:
- WAL files
- Metadata files
- Alignment padding

## Version Compatibility

Shardex follows semantic versioning:

- **Major version**: Breaking API changes
- **Minor version**: New features, backward compatible
- **Patch version**: Bug fixes, backward compatible

Index file formats are versioned separately and include migration support for minor version upgrades.