<div align="center">
<h1>NextPlaid</h1>
<p>A Rust library for multi-vector search using the PLAID algorithm.<br/>
Memory-mapped, quantized, incremental. Powers <a href="../colgrep">ColGREP</a> and the <a href="../next-plaid-api">NextPlaid API</a>.</p>
<p>
<a href="#quick-start"><b>Quick Start</b></a>
·
<a href="#architecture"><b>Architecture</b></a>
·
<a href="#api-reference"><b>API Reference</b></a>
·
<a href="#filtering"><b>Filtering</b></a>
·
<a href="#cuda-acceleration"><b>CUDA</b></a>
</p>
</div>
---
## Quick Start
Add to `Cargo.toml`:
```toml
[dependencies]
next-plaid = "1.0"
```
Create an index, add documents, search:
```rust
use next_plaid::{MmapIndex, IndexConfig, UpdateConfig, SearchParameters};
use ndarray::Array2;
// Each document is a 2D array [num_tokens, embedding_dim]
let embeddings: Vec<Array2<f32>> = encode_documents(&docs);
// Create index (or update if it already exists)
let index_config = IndexConfig { nbits: 4, ..Default::default() };
let update_config = UpdateConfig::default();
let (index, doc_ids) = MmapIndex::update_or_create(
&embeddings, "/path/to/index", &index_config, &update_config,
)?;
// Search
let query: Array2<f32> = encode_query("search text");
let params = SearchParameters { top_k: 10, ..Default::default() };
let results = index.search(&query, ¶ms, None)?;
for (id, score) in results.passage_ids.iter().zip(results.scores.iter()) {
println!("Doc {}: {:.4}", id, score);
}
```
---
## Why Multi-Vector?
Standard vector search collapses a document into **one** embedding. That's lossy. Multi-vector search (ColBERT) keeps **one embedding per token** (~300 vectors per document, dim 128). At query time, each query token finds its best match across all document tokens (**MaxSim**). This preserves fine-grained information that single-vector models lose.
The trade-off is storage. NextPlaid solves this with product quantization (2-bit or 4-bit) and memory-mapped indices, making million-document collections practical on a single machine.
---
## Architecture
```mermaid
flowchart TD
A["Document Embeddings\n[num_tokens, dim] per doc"] --> B["K-Means Clustering"]
B --> C["Centroid Assignment\n+ Residual Computation"]
C --> D["Product Quantization\n2-bit or 4-bit"]
D --> E["Memory-Mapped Index\nIVF + Codes + Residuals"]
Q["Query Embedding"] --> F["IVF Probing\nTop centroids per token"]
F --> G["Candidate Retrieval"]
G --> H["Approximate Scoring\nCentroid MaxSim"]
H --> I["Exact Re-ranking\nDecompress + Full MaxSim"]
I --> J["Top-K Results"]
E --> G
style A fill:#4a90d9,stroke:#357abd,color:#fff
style B fill:#50b86c,stroke:#3d9956,color:#fff
style C fill:#50b86c,stroke:#3d9956,color:#fff
style D fill:#50b86c,stroke:#3d9956,color:#fff
style E fill:#e8913a,stroke:#d07a2e,color:#fff
style Q fill:#4a90d9,stroke:#357abd,color:#fff
style F fill:#9b59b6,stroke:#8445a0,color:#fff
style G fill:#9b59b6,stroke:#8445a0,color:#fff
style H fill:#9b59b6,stroke:#8445a0,color:#fff
style I fill:#9b59b6,stroke:#8445a0,color:#fff
style J fill:#9b59b6,stroke:#8445a0,color:#fff
```
### Indexing Pipeline
1. **K-Means clustering** on all token embeddings to find centroids (IVF codebook)
2. **Assign** each token to its nearest centroid, compute **residual** (difference)
3. **Quantize** residuals with product quantization (2-bit: 4 buckets, 4-bit: 16 buckets)
4. **Write** IVF posting lists, codes, and residuals as memory-mapped NPY files
5. Optionally store document **metadata** in a co-located SQLite database
### Search Pipeline
1. **IVF probing** — Score query tokens against centroids, select top `n_ivf_probe` centroids per token
2. **Candidate retrieval** — Collect document IDs from selected posting lists
3. **Approximate scoring** — MaxSim using centroid vectors (fast, coarse)
4. **Re-ranking** — Decompress top `n_full_scores` candidates, compute exact ColBERT MaxSim
5. **Return** top `top_k` results with scores
MaxSim scoring is SIMD-accelerated (AVX2 on x86_64, NEON on ARM) and optionally BLAS-accelerated via Apple Accelerate or OpenBLAS.
### Update Modes
Incremental updates use three strategies depending on index size:
| **Rebuild** | `num_docs <= start_from_scratch` (default: 999) | Load existing embeddings + new, full K-means rebuild |
| **Buffer** | `new_docs < buffer_size` (default: 100) | Assign to existing centroids, buffer for later |
| **Expand** | `new_docs >= buffer_size` | Find outlier embeddings, expand centroids via K-means, re-index buffer + new |
---
## Installation
### Feature Flags
| *(default)* | All | Pure Rust, no external BLAS |
| `accelerate` | macOS | Apple Accelerate for BLAS (recommended on M-series) |
| `openblas` | Linux | OpenBLAS for BLAS |
| `cuda` | Linux/Windows | CUDA acceleration for K-means and MaxSim scoring |
```toml
# macOS (recommended)
[dependencies]
next-plaid = { version = "1.0", features = ["accelerate"] }
# Linux with OpenBLAS
[dependencies]
next-plaid = { version = "1.0", features = ["openblas"] }
# Linux with CUDA + OpenBLAS
[dependencies]
next-plaid = { version = "1.0", features = ["cuda", "openblas"] }
```
<details>
<summary><b>OpenBLAS system setup</b></summary>
```bash
# Debian/Ubuntu
sudo apt install libopenblas-dev
# Fedora/RHEL
sudo dnf install openblas-devel
# Arch
sudo pacman -S openblas
```
</details>
---
## API Reference
### `MmapIndex`
The primary interface. Loads index files as memory-mapped arrays for low RAM usage.
```rust
// Load existing index
let index = MmapIndex::load("/path/to/index")?;
// Create or update (creates if missing, appends if exists)
let (index, doc_ids) = MmapIndex::update_or_create(
&embeddings, "/path/to/index", &index_config, &update_config,
)?;
// Search (single query)
let results = index.search(&query, ¶ms, None)?;
// Search (batch)
let results = index.search_batch(&queries, ¶ms, true, None)?;
// Search within a subset of documents
let results = index.search(&query, ¶ms, Some(&subset_ids))?;
// Add documents to existing index
let new_ids = index.update(&new_embeddings, &update_config)?;
// Add documents with metadata
let new_ids = index.update_with_metadata(
&new_embeddings, &update_config, Some(&metadata),
)?;
// Delete documents
let deleted = index.delete(&[5, 10, 15])?;
// Reconstruct embeddings from compressed storage
let embeddings = index.reconstruct(&[0, 1, 2])?;
let single = index.reconstruct_single(0)?;
// Accessors
index.num_documents();
index.num_embeddings();
index.num_partitions();
index.avg_doclen();
index.embedding_dim();
```
### `IndexConfig`
Controls index creation.
```rust
IndexConfig {
nbits: 4, // Quantization bits (2 or 4)
batch_size: 50_000, // Documents per indexing chunk
seed: Some(42), // Random seed for K-means
kmeans_niters: 4, // K-means iterations
max_points_per_centroid: 256,// K-means parameter
n_samples_kmeans: None, // Auto: min(1 + 16*sqrt(120*N), N)
start_from_scratch: 999, // Rebuild threshold
force_cpu: false, // Force CPU K-means (skip CUDA)
}
```
### `SearchParameters`
Controls search behavior.
```rust
SearchParameters {
top_k: 10, // Results to return
n_full_scores: 4096, // Candidates for exact re-ranking
n_ivf_probe: 8, // IVF cells to probe per query token
centroid_batch_size: 100_000, // Batch size for centroid scoring
centroid_score_threshold: Some(0.4), // Prune low-scoring centroids
batch_size: 2000, // Query batch size
}
```
### `UpdateConfig`
Controls incremental updates.
```rust
UpdateConfig {
batch_size: 50_000, // Documents per chunk
kmeans_niters: 4, // K-means iterations for centroid expansion
max_points_per_centroid: 256,
n_samples_kmeans: None,
seed: 42,
start_from_scratch: 999, // Below this, rebuild from scratch
buffer_size: 100, // Buffer before centroid expansion
force_cpu: false,
}
```
### `QueryResult`
Search results container.
```rust
pub struct QueryResult {
pub query_id: usize,
pub passage_ids: Vec<i64>, // Document IDs (ranked by score)
pub scores: Vec<f32>, // Corresponding scores
}
pub type SearchResult = QueryResult;
```
### `Metadata`
Index metadata, persisted as `metadata.json`.
```rust
pub struct Metadata {
pub num_chunks: usize,
pub nbits: usize,
pub num_partitions: usize,
pub num_embeddings: usize,
pub avg_doclen: f64,
pub num_documents: usize,
pub embedding_dim: usize,
pub next_plaid_compatible: bool,
}
```
### `ResidualCodec`
Quantization codec for compression and decompression.
```rust
// Load codec from index directory
let codec = ResidualCodec::load_from_dir(&index_path)?;
let codec = ResidualCodec::load_mmap_from_dir(&index_path)?; // memory-mapped centroids
// Compress embeddings to codes (nearest centroid assignments)
let codes = codec.compress_into_codes(&embeddings);
// Quantize residuals
let quantized = codec.quantize_residuals(&residuals)?;
// Decompress back to approximate embeddings
let reconstructed = codec.decompress(&packed, &codes)?;
// Accessors
codec.embedding_dim();
codec.num_centroids();
codec.centroids_view();
```
---
## Filtering
SQLite-based metadata filtering via the `filtering` module. Store arbitrary JSON metadata alongside your index, query with SQL WHERE clauses.
```rust
use next_plaid::filtering;
use serde_json::json;
// Create metadata database alongside index
let metadata = vec![
json!({"title": "Doc A", "category": "tech", "score": 95}),
json!({"title": "Doc B", "category": "science", "score": 87}),
];
filtering::create("/path/to/index", &metadata, &[0, 1])?;
// Append more metadata
filtering::update("/path/to/index", &new_metadata, &new_doc_ids)?;
// Query by SQL condition (parameterized, injection-safe)
let ids = filtering::where_condition(
"/path/to/index",
"category = ? AND score > ?",
&[json!("tech"), json!(90)],
)?;
// Query with REGEXP support
let ids = filtering::where_condition_regexp(
"/path/to/index",
"title REGEXP ?",
&[json!("^Doc")],
)?;
// Get full metadata rows
let rows = filtering::get("/path/to/index", None, &[], Some(&[0, 1]))?;
// Count documents
let count = filtering::count("/path/to/index")?;
// Delete and re-index
filtering::delete("/path/to/index", &[5, 10])?;
// Use in search: filter first, then search within subset
let subset = filtering::where_condition("/path/to/index", "category = ?", &[json!("tech")])?;
let results = index.search(&query, ¶ms, Some(&subset))?;
```
SQL conditions are validated with a recursive descent parser that whitelists safe operators and prevents injection.
---
## CUDA Acceleration
When built with the `cuda` feature, NextPlaid automatically uses GPU acceleration for:
- **K-means clustering** during index creation and centroid expansion
- **MaxSim scoring** during search (for large enough matrices)
No code changes needed. CUDA falls back gracefully to CPU on failure.
```toml
[dependencies]
next-plaid = { version = "1.0", features = ["cuda"] }
```
The CUDA module uses cuBLAS for matrix multiplication and custom PTX kernels for argmax operations. A global `CudaContext` is lazily initialized on first use.
> **Tip:** First CUDA context creation can take 10-30s. Enable GPU persistence mode to reduce this: `sudo nvidia-smi -pm 1`
---
## Index File Structure
```
index_directory/
metadata.json # Index metadata (num_docs, nbits, partitions, etc.)
centroids.npy # Centroid embeddings [K, dim]
avg_residual.npy # Average residual per dimension
bucket_cutoffs.npy # Quantization boundaries
bucket_weights.npy # Reconstruction values
cluster_threshold.npy # Outlier detection threshold
ivf.npy # Inverted file (doc IDs per centroid)
ivf_lengths.npy # Length of each IVF posting list
plan.json # Indexing plan
merged_codes.npy # Memory-mapped centroid codes (auto-merged)
merged_residuals.npy # Memory-mapped quantized residuals (auto-merged)
metadata.db # SQLite metadata database (optional)
# Per-chunk files (merged into merged_*.npy on load):
0.codes.npy # Centroid assignments for chunk 0
0.residuals.npy # Quantized residuals for chunk 0
0.metadata.json # Chunk metadata
doclens.0.json # Document lengths for chunk 0
```
---
## Modules
| `filtering` | 1,896 | SQLite metadata storage, SQL condition validation, REGEXP support |
| `mmap` | 1,779 | Memory-mapped NPY/raw arrays, merge-on-load, file locking |
| `index` | 1,389 | Index creation, `MmapIndex`, `IndexConfig`, `Metadata` |
| `update` | 977 | Incremental updates, buffer/expand strategies |
| `cuda` | 769 | CUDA context, cuBLAS MatMul, PTX argmax kernel |
| `codec` | 701 | Residual quantization, compress/decompress, lookup tables |
| `search` | 714 | IVF probing, candidate retrieval, approximate + exact scoring |
| `delete` | 540 | Document deletion, IVF rebuild |
| `kmeans` | 475 | K-means clustering, centroid computation, partition estimation |
| `maxsim` | 443 | SIMD MaxSim (AVX2/NEON), BLAS matrix multiply |
| `utils` | 237 | Quantile computation, array utilities |
| `embeddings` | 137 | Embedding reconstruction from compressed storage |
| `error` | 66 | Error types |
---
## Dependencies
| `ndarray` | N-dimensional arrays |
| `rayon` | Parallelism |
| `memmap2` | Memory-mapped files |
| `ndarray-npy` | NPY file I/O |
| `fastkmeans-rs` | K-means clustering |
| `rusqlite` | SQLite (bundled) |
| `half` | Float16 support |
| `regex` | REGEXP filtering |
| `cudarc` | CUDA bindings (optional) |
| `serde` / `serde_json` | Serialization |
| `thiserror` | Error handling |
---
## License
Apache-2.0