# Future Enhancements
This document tracks features that are planned but deferred to future iterations. Each section includes:
- Current status (stub, not started, etc.)
- Rationale for deferral
- Implementation considerations
- Links to relevant code stubs
---
## Table of Contents
1. [GPU Acceleration for Embeddings](#gpu-acceleration-for-embeddings)
2. [Embedding Cache](#embedding-cache)
3. [AI-Native Protocols](#ai-native-protocols)
4. [Vector Store Migration Utility](#vector-store-migration-utility)
5. [Advanced Search Features](#advanced-search-features)
---
## GPU Acceleration for Embeddings
### Status: Deferred (Stubs Only)
### Rationale
GPU acceleration for embedding models would significantly improve throughput for batch document ingestion. However, it requires:
1. Platform-specific setup (CUDA drivers, Metal framework, Vulkan SDK)
2. Additional feature flags and conditional compilation
3. Testing infrastructure across multiple GPU types
4. Build complexity for CI/CD
### Implementation Considerations
#### ONNX Runtime Execution Providers
FastEmbed uses `ort` (ONNX Runtime) internally. GPU acceleration can be added via execution providers:
```rust
// Potential implementation in src/rag/embeddings.rs
use ort::{ExecutionProvider, SessionBuilder};
pub enum AccelerationBackend {
Cpu,
Cuda { device_id: usize },
TensorRT { device_id: usize },
CoreML, // macOS
DirectML, // Windows
OpenVINO,
}
impl AccelerationBackend {
fn to_execution_providers(&self) -> Vec<ExecutionProvider> {
match self {
Self::Cpu => vec![ExecutionProvider::CPU(Default::default())],
Self::Cuda { device_id } => vec![
ExecutionProvider::CUDA(CUDAExecutionProvider::default().with_device_id(*device_id)),
ExecutionProvider::CPU(Default::default()), // Fallback
],
// ... other providers
}
}
}
```
#### Candle GPU Support (for Qwen3)
Qwen3 embeddings use Candle instead of ONNX. GPU support is available via features:
```toml
# Future Cargo.toml additions
candle-core = { version = "0.9", features = ["cuda"] } # NVIDIA
candle-core = { version = "0.9", features = ["metal"] } # Apple Silicon
```
#### Feature Flags (Proposed)
```toml
[features]
# GPU acceleration (mutually exclusive per platform)
cuda = ["ort/cuda", "candle-core/cuda"]
metal = ["ort/coreml", "candle-core/metal"]
vulkan = ["ort/vulkan"]
directml = ["ort/directml"]
```
### Stub Location
- `src/rag/embeddings.rs` - `AccelerationBackend` enum (placeholder)
### References
- [ONNX Runtime Execution Providers](https://onnxruntime.ai/docs/execution-providers/)
- [Candle GPU Support](https://github.com/huggingface/candle#with-cuda-support)
---
## Embedding Cache
### Status: Implemented (In-Memory LRU)
The in-memory LRU embedding cache is now fully implemented. See `src/rag/cache.rs`.
### Current Implementation
- **Backend**: In-memory LRU cache using `parking_lot::RwLock<HashMap>`
- **Key Strategy**: SHA-256 hash of `text + model_name`
- **Eviction Policy**: LRU (Least Recently Used) with configurable max size
- **TTL Support**: Optional per-entry TTL with automatic expiry
- **Thread-safe**: Uses atomic counters and `RwLock` for concurrent access
### Usage
```rust
use ares::rag::embeddings::{CachedEmbeddingService, EmbeddingConfig};
use ares::rag::cache::CacheConfig;
// Create a cached embedding service
let service = CachedEmbeddingService::new(
EmbeddingConfig::default(),
CacheConfig {
max_size_bytes: 512 * 1024 * 1024, // 512 MB
default_ttl: None, // No expiry
enabled: true,
},
)?;
// Embeddings are automatically cached
let embedding = service.embed_text("hello world").await?;
// Check cache stats
let stats = service.cache_stats();
println!("Hit rate: {:.1}%", stats.hit_rate());
```
### Future Backend Options
Additional backends can be added by implementing the `EmbeddingCache` trait:
| In-memory (LRU) | Fast, no dependencies | Limited capacity, lost on restart |
| Redis | Distributed, persistent | External dependency |
| Disk (sled/rocksdb) | Large capacity, persistent | Slower than memory |
| SQLite | Simple, persistent | May conflict with main DB |
### Configuration
```rust
use ares::rag::cache::CacheConfig;
let config = CacheConfig {
max_size_bytes: 256 * 1024 * 1024, // 256 MB (default)
default_ttl: None, // No expiry (default)
enabled: true, // Enabled by default
};
```
### Implementation Location
- `src/rag/cache.rs` - `EmbeddingCache` trait, `LruEmbeddingCache`, and `NoOpCache` implementations
- `src/rag/embeddings.rs` - `CachedEmbeddingService` wrapper
### Future Configuration (Proposed TOML)
```toml
[rag.cache]
enabled = true
backend = "memory" # "memory", "redis", "disk"
max_size_mb = 512
ttl_hours = 24
# Redis-specific
[rag.cache.redis]
url = "redis://localhost:6379"
prefix = "ares:embeddings:"
# Disk-specific
[rag.cache.disk]
path = "./data/embedding_cache"
```
---
## AI-Native Protocols
### Status: Not Started (Research Required)
### Rationale
These are emerging protocols for agent-to-agent communication and UI integration. They are not yet standardized or widely adopted.
### Protocols to Research
| MCP | Model Context Protocol | Tool/context sharing | ✅ Implemented |
| ACP | Agent Communication Protocol | Agent messaging | Research needed |
| AG-UI | Agent UI Protocol | UI components for agents | Research needed |
| ANP | Agent Network Protocol | Multi-agent networking | Research needed |
| A2A | Agent-to-Agent Protocol | Direct agent communication | Research needed |
### Research Questions
1. Are there official specifications for these protocols?
2. Do Rust implementations exist?
3. Which protocols are production-ready vs experimental?
4. How do they integrate with existing MCP implementation?
### Next Steps
1. Search for official documentation/specs
2. Check for Rust crates on crates.io
3. Evaluate stability and adoption
4. Prioritize based on user demand
---
## Vector Store Migration Utility
### Status: Not Started
### Rationale
As users may switch between vector store providers (e.g., from Qdrant to LanceDB), a migration utility would help preserve indexed data.
### Proposed Features
```rust
// Future: src/rag/migration.rs
pub struct VectorStoreMigration {
source: Box<dyn VectorStore>,
destination: Box<dyn VectorStore>,
}
impl VectorStoreMigration {
pub async fn migrate_collection(
&self,
collection: &str,
batch_size: usize,
progress: impl Fn(MigrationProgress),
) -> Result<MigrationReport> {
// 1. List all documents in source
// 2. Batch read documents with embeddings
// 3. Batch upsert to destination
// 4. Verify counts match
}
}
pub struct MigrationProgress {
pub total: usize,
pub completed: usize,
pub current_batch: usize,
}
pub struct MigrationReport {
pub documents_migrated: usize,
pub duration: Duration,
pub errors: Vec<MigrationError>,
}
```
### CLI Integration (Proposed)
```bash
# Future CLI command
ares-server migrate \
--source qdrant \
--source-url http://localhost:6334 \
--destination lancedb \
--destination-path ./data/lancedb \
--collection documents \
--batch-size 100
```
---
## Advanced Search Features
### Status: Partially Planned
### Features for Future Iterations
#### 1. Full-Text Search with Tantivy
```toml
# Future dependency
tantivy = { version = "0.25.0", optional = true }
```
Tantivy is a Lucene-like full-text search engine in Rust. It would provide:
- Advanced query parsing
- Faceted search
- Phrase queries
- Highlighting
#### 2. Approximate Nearest Neighbor (ANN) Tuning
Most vector stores support ANN algorithm tuning:
```toml
[rag.vectorstore.qdrant]
# HNSW index parameters
hnsw_m = 16 # Number of connections per layer
hnsw_ef_construct = 100 # Construction-time search width
```
#### 3. Multi-Vector Search
Support for multiple embedding types per document:
```rust
pub struct MultiVectorDocument {
pub id: String,
pub content: String,
pub dense_embedding: Vec<f32>, // Semantic
pub sparse_embedding: SparseVec, // BM25/SPLADE
pub late_interaction: Vec<Vec<f32>>, // ColBERT-style
}
```
#### 4. Query Expansion
Automatic query expansion using:
- Synonyms
- LLM-generated alternatives
- Historical successful queries
```rust
pub struct QueryExpander {
llm: Box<dyn LLMClient>,
synonym_db: Option<SynonymDatabase>,
}
impl QueryExpander {
pub async fn expand(&self, query: &str) -> Vec<String> {
// Generate alternative phrasings
}
}
```
---
## LLM Client Connection Pooling (DIR-44)
### Status: ✅ Implemented
Connection pooling for LLM clients has been implemented to enable efficient connection reuse across requests, reducing latency and resource consumption.
### Implementation Details
**Location**: `src/llm/pool.rs`
**Features**:
- ✅ Pool management with configurable max connections per provider
- ✅ Connection reuse across requests via `PooledClientGuard` RAII pattern
- ✅ Health checking for stale connections (idle timeout + max lifetime)
- ✅ Graceful cleanup on shutdown
- ✅ Background cleanup task for removing stale connections
- ✅ Thread-safe with `parking_lot::RwLock` and `tokio::sync::Semaphore`
- ✅ Builder pattern for pool configuration
### Usage
```rust
use ares::llm::pool::{ClientPool, ClientPoolBuilder, PoolConfig};
use ares::llm::Provider;
// Create a pool with custom configuration
let pool = ClientPoolBuilder::new()
.config(PoolConfig::default()
.with_max_connections(10)
.with_idle_timeout(Duration::from_secs(300)))
.provider("openai", openai_provider)
.provider("anthropic", anthropic_provider)
.build_arc();
// Start background cleanup task (optional)
let cleanup_handle = pool.start_cleanup_task();
// Get a pooled client - automatically returned when guard is dropped
let guard = pool.get("openai").await?;
let response = guard.generate("Hello!").await?;
// Check pool statistics
let stats = pool.stats();
println!("Available: {}, In Use: {}", stats.total_available, stats.total_in_use);
// Graceful shutdown
pool.shutdown();
```
### Configuration
```rust
use ares::llm::pool::PoolConfig;
let config = PoolConfig {
max_connections_per_provider: 10, // Max clients per provider
min_idle_connections: 2, // Minimum idle clients to maintain
idle_timeout: Duration::from_secs(300), // 5 min idle timeout
max_lifetime: Duration::from_secs(1800), // 30 min max lifetime
health_check_interval: Duration::from_secs(60), // Cleanup interval
acquire_timeout: Duration::from_secs(30), // Timeout for getting client
enable_health_check: true,
};
```
### Implementation Location
- `src/llm/pool.rs` - Core pooling implementation
- `src/llm/mod.rs` - Public exports
### References
- [DIR-44 Implementation](../src/llm/pool.rs)
---
## Implementation Priority
When resources become available, implement in this order:
1. ~~**Embedding Cache** (High impact, moderate effort)~~ - **DONE**
2. ~~**LLM Client Pooling (DIR-44)** (High impact, moderate effort)~~ - **DONE**
3. **GPU Acceleration** (High impact, high effort)
4. **Advanced Search** (Medium impact, varies)
5. **Migration Utility** (Low priority unless requested)
6. **AI-Native Protocols** (Pending standardization)
---
## Contributing
If you want to work on any of these features:
1. Check the stub code locations mentioned above
2. Review the implementation considerations
3. Open an issue to discuss approach before starting
4. Follow the existing patterns in the codebase
See [CONTRIBUTING.md](../CONTRIBUTING.md) for development guidelines.
---
**Last Updated**: 2026-02-03
**Related**: [DIR-24 Implementation Plan](./DIR-24_RAG_IMPLEMENTATION_PLAN.md)