shardex 0.1.0 - Docs.rs

# Getting Started with Shardex

This guide will walk you through setting up and using Shardex, a high-performance memory-mapped vector search engine.

## Prerequisites

- Rust 1.70 or later
- Tokio runtime for async operations
- Sufficient disk space for index files (typically 2-3x the size of your vector data)
- Memory mapping support (Linux, macOS, Windows)

## Installation

Add Shardex to your `Cargo.toml`:

```toml
[dependencies]
shardex = "0.1"
tokio = { version = "1.0", features = ["rt", "macros"] }
```

## Your First Index

Let's create a simple vector search index:

```rust
use shardex::{Shardex, ShardexConfig, ShardexImpl, Posting, DocumentId};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create configuration
    let config = ShardexConfig::new()
        .directory_path("./my_first_index")
        .vector_size(128); // 128-dimensional vectors
    
    // Create the index
    let mut index = ShardexImpl::create(config).await?;
    
    println!("Index created successfully!");
    Ok(())
}
```

## Adding Your First Document

```rust
// Create a posting (document with vector)
let document_id = DocumentId::from_u128(1);
let vector = vec![0.1, 0.2, 0.3, /* ... 128 total values */];

let posting = Posting {
    document_id,
    start: 0,      // Start position in document
    length: 100,   // Length of text segment
    vector,        // The embedding vector
};

// Add to index
index.add_postings(vec![posting]).await?;
index.flush().await?; // Ensure data is written
```

## Performing Your First Search

```rust
// Search for similar vectors
let query_vector = vec![0.15, 0.25, 0.35, /* ... 128 total values */];
let results = index.search(&query_vector, 5, None).await?; // Get top 5 results

for (i, result) in results.iter().enumerate() {
    println!("Result {}: Document {} (similarity: {:.3})", 
        i + 1, 
        result.document_id.to_u128(),
        result.similarity_score
    );
}
```

## Opening an Existing Index

```rust
// Open existing index (configuration is read from metadata)
let index = ShardexImpl::open("./my_first_index").await?;

// Index is ready to use for searching
let results = index.search(&query_vector, 10, None).await?;
```

## Configuration Basics

Shardex provides many configuration options. Here are the most important ones:

### Vector Size
The dimensionality of your vectors. Must be consistent across all vectors in the index.

```rust
.vector_size(384) // For BERT-base embeddings
.vector_size(768) // For BERT-large embeddings  
.vector_size(1536) // For OpenAI text-embedding-ada-002
```

### Shard Size
How many vectors to store in each shard. Larger shards mean fewer splits but more memory usage.

```rust
.shard_size(10000)  // Default: good for most use cases
.shard_size(50000)  // For very large datasets
.shard_size(5000)   // For memory-constrained environments
```

### Directory Path
Where to store the index files. The directory will be created if it doesn't exist.

```rust
.directory_path("./search_index")
.directory_path("/var/lib/myapp/index")
```

## Working with Batches

For best performance, add documents in batches:

```rust
let mut postings = Vec::new();

for i in 0..1000 {
    let document_id = DocumentId::from_u128(i + 1);
    let vector = generate_embedding(i); // Your embedding function
    
    postings.push(Posting {
        document_id,
        start: 0,
        length: 100,
        vector,
    });
}

// Add all at once
index.add_postings(postings).await?;
index.flush().await?;
```

## Error Handling

Always handle errors appropriately:

```rust
match index.add_postings(postings).await {
    Ok(_) => println!("Successfully added postings"),
    Err(ShardexError::InvalidDimension { expected, actual }) => {
        eprintln!("Wrong vector size: expected {}, got {}", expected, actual);
    }
    Err(ShardexError::Io(io_err)) => {
        eprintln!("I/O error: {}", io_err);
    }
    Err(e) => {
        eprintln!("Other error: {}", e);
    }
}
```

## Monitoring Your Index

Check index statistics to monitor health and performance:

```rust
let stats = index.stats().await?;
println!("Index has {} documents in {} shards", 
    stats.total_postings, 
    stats.total_shards
);
println!("Memory usage: {:.1}MB", 
    stats.memory_usage as f64 / 1024.0 / 1024.0
);
```

## Document Removal

Remove documents by their IDs:

```rust
let document_ids = vec![1, 2, 3, 4, 5]; // Documents to remove
index.remove_documents(document_ids).await?;
index.flush().await?;
```

## Document Text Storage

Shardex can store document source text alongside vector embeddings, enabling rich search experiences:

### Enabling Text Storage

Add text storage to your configuration:

```rust
let config = ShardexConfig::new()
    .directory_path("./text_enabled_index")
    .vector_size(128)
    .max_document_text_size(1024 * 1024); // 1MB per document
```

**Important:** Text storage is disabled by default (`max_document_text_size = 0`).

### Storing Documents with Text

Use `replace_document_with_postings` for atomic document operations:

```rust
let document_text = "The quick brown fox jumps over the lazy dog.";
let doc_id = DocumentId::from_u128(1);

let postings = vec![
    Posting {
        document_id: doc_id,
        start: 0,
        length: 9, // "The quick"
        vector: embedding1,
    },
    Posting {
        document_id: doc_id,
        start: 10,
        length: 9, // "brown fox"
        vector: embedding2,
    },
];

// Store document with text and postings atomically
index.replace_document_with_postings(
    doc_id, 
    document_text.to_string(), 
    postings
).await?;
```

### Retrieving Document Text

Get the complete document text:

```rust
let full_text = index.get_document_text(doc_id).await?;
println!("Document: {}", full_text);
```

### Extracting Text from Search Results

Extract text snippets from search results:

```rust
let results = index.search(&query_vector, 10, None).await?;

for result in results {
    // Convert search result to posting for text extraction
    let posting = Posting {
        document_id: result.document_id,
        start: result.start,
        length: result.length,
        vector: result.vector,
    };
    
    let text_snippet = index.extract_text(&posting).await?;
    println!("Found: '{}' (score: {:.4})", 
             text_snippet, 
             result.similarity_score);
}
```

### Text Storage Error Handling

Handle text-specific errors:

```rust
match index.get_document_text(doc_id).await {
    Ok(text) => println!("Text: {}", text),
    Err(ShardexError::DocumentTextNotFound { document_id }) => {
        println!("No text stored for document {}", document_id);
    }
    Err(ShardexError::InvalidRange { start, length, document_length }) => {
        println!("Invalid coordinates {}..{} for document length {}", 
                 start, start + length, document_length);
    }
    Err(e) => println!("Other error: {}", e),
}
```

### Text Storage Best Practices

1. **Size Limits**: Set appropriate `max_document_text_size` for your use case
2. **Atomic Operations**: Always use `replace_document_with_postings` for consistency
3. **Error Handling**: Handle text-specific error types
4. **Backward Compatibility**: Indexes without text storage continue to work

### Complete Text Storage Example

```rust
use shardex::{Shardex, ShardexConfig, ShardexImpl, Posting, DocumentId};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create index with text storage
    let config = ShardexConfig::new()
        .directory_path("./text_example")
        .vector_size(128)
        .max_document_text_size(1024 * 1024);
    
    let mut index = ShardexImpl::create(config).await?;
    
    // Store document with text
    let doc_text = "Artificial intelligence is transforming technology.";
    let doc_id = DocumentId::from_u128(1);
    
    let postings = vec![
        Posting {
            document_id: doc_id,
            start: 0,
            length: 21, // "Artificial intelligence"
            vector: generate_embedding("artificial intelligence"),
        },
        Posting {
            document_id: doc_id,
            start: 25,
            length: 12, // "transforming"
            vector: generate_embedding("transforming"),
        }
    ];
    
    index.replace_document_with_postings(
        doc_id, 
        doc_text.to_string(), 
        postings
    ).await?;
    
    // Search and extract text
    let query = generate_embedding("AI technology");
    let results = index.search(&query, 5, None).await?;
    
    for result in results {
        let posting = Posting {
            document_id: result.document_id,
            start: result.start,
            length: result.length,
            vector: result.vector,
        };
        
        if let Ok(text) = index.extract_text(&posting).await {
            println!("Match: '{}' (score: {:.3})", text, result.similarity_score);
        }
    }
    
    Ok(())
}

// Placeholder function - use your actual embedding model
fn generate_embedding(text: &str) -> Vec<f32> {
    vec![0.1; 128] // Replace with real embeddings
}
```

## Search Parameters

### K (Number of Results)
How many similar vectors to return:

```rust
let results = index.search(&query, 10, None).await?; // Get top 10
```

### Slop Factor (Search Breadth)
Controls the trade-off between speed and accuracy:

```rust
let results = index.search(&query, 10, Some(1)).await?; // Fastest, least accurate
let results = index.search(&query, 10, Some(5)).await?; // Balanced (default)
let results = index.search(&query, 10, Some(10)).await?; // Slowest, most accurate
```

## Complete Example

Here's a complete working example:

```rust
use shardex::{Shardex, ShardexConfig, ShardexImpl, Posting, DocumentId};
use std::error::Error;

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    // Create index
    let config = ShardexConfig::new()
        .directory_path("./tutorial_index")
        .vector_size(3); // Small example
    
    let mut index = ShardexImpl::create(config).await?;
    
    // Add some documents
    let documents = vec![
        (1, vec![1.0, 0.0, 0.0]), // Document 1
        (2, vec![0.0, 1.0, 0.0]), // Document 2  
        (3, vec![0.0, 0.0, 1.0]), // Document 3
        (4, vec![0.5, 0.5, 0.0]), // Document 4
    ];
    
    let mut postings = Vec::new();
    for (id, vector) in documents {
        postings.push(Posting {
            document_id: DocumentId::from_u128(id),
            start: 0,
            length: 100,
            vector,
        });
    }
    
    index.add_postings(postings).await?;
    index.flush().await?;
    
    // Search for vector similar to [1.0, 0.1, 0.0]
    let query = vec![1.0, 0.1, 0.0];
    let results = index.search(&query, 2, None).await?;
    
    println!("Search results:");
    for result in results {
        println!("Document {}: similarity {:.3}", 
            result.document_id.to_u128(),
            result.similarity_score
        );
    }
    
    // Cleanup
    std::fs::remove_dir_all("./tutorial_index")?;
    
    Ok(())
}
```

## Next Steps

- Read the [Architecture Overview](architecture.md) to understand how Shardex works
- Check the [Performance Tuning](performance.md) guide for optimization tips
- See the [examples/](../examples/) directory for more complex use cases
- Refer to the [API Reference](https://docs.rs/shardex) for complete documentation

## Common Gotchas

1. **Vector Dimension Mismatch**: All vectors must have the same dimension as specified in the config
2. **Forgetting to Flush**: Call `flush()` to ensure data is written to disk
3. **Empty Vectors**: Vectors cannot be empty
4. **Directory Permissions**: Ensure the process has write permissions to the index directory
5. **Concurrent Access**: Each index can only be used by one process at a time

## Getting Help

If you encounter issues:

1. Check the [Troubleshooting Guide](troubleshooting.md)
2. Look at the examples in the repository
3. File an issue on GitHub with a minimal reproduction case
4. Check that your configuration is valid for your use case