Quick Start
Add to Cargo.toml:
[]
= "1.0"
Create an index, add documents, search:
use ;
use Array2;
// Each document is a 2D array [num_tokens, embedding_dim]
let embeddings: = encode_documents;
// Create index (or update if it already exists)
let index_config = IndexConfig ;
let update_config = default;
let = update_or_create?;
// Search
let query: = encode_query;
let params = SearchParameters ;
let results = index.search?;
for in results.passage_ids.iter.zip
Why Multi-Vector?
Standard vector search collapses a document into one embedding. That's lossy. Multi-vector search (ColBERT) keeps one embedding per token (~300 vectors per document, dim 128). At query time, each query token finds its best match across all document tokens (MaxSim). This preserves fine-grained information that single-vector models lose.
The trade-off is storage. NextPlaid solves this with product quantization (2-bit or 4-bit) and memory-mapped indices, making million-document collections practical on a single machine.
Architecture
flowchart TD
A["Document Embeddings\n[num_tokens, dim] per doc"] --> B["K-Means Clustering"]
B --> C["Centroid Assignment\n+ Residual Computation"]
C --> D["Product Quantization\n2-bit or 4-bit"]
D --> E["Memory-Mapped Index\nIVF + Codes + Residuals"]
Q["Query Embedding"] --> F["IVF Probing\nTop centroids per token"]
F --> G["Candidate Retrieval"]
G --> H["Approximate Scoring\nCentroid MaxSim"]
H --> I["Exact Re-ranking\nDecompress + Full MaxSim"]
I --> J["Top-K Results"]
E --> G
style A fill:#4a90d9,stroke:#357abd,color:#fff
style B fill:#50b86c,stroke:#3d9956,color:#fff
style C fill:#50b86c,stroke:#3d9956,color:#fff
style D fill:#50b86c,stroke:#3d9956,color:#fff
style E fill:#e8913a,stroke:#d07a2e,color:#fff
style Q fill:#4a90d9,stroke:#357abd,color:#fff
style F fill:#9b59b6,stroke:#8445a0,color:#fff
style G fill:#9b59b6,stroke:#8445a0,color:#fff
style H fill:#9b59b6,stroke:#8445a0,color:#fff
style I fill:#9b59b6,stroke:#8445a0,color:#fff
style J fill:#9b59b6,stroke:#8445a0,color:#fff
Indexing Pipeline
- K-Means clustering on all token embeddings to find centroids (IVF codebook)
- Assign each token to its nearest centroid, compute residual (difference)
- Quantize residuals with product quantization (2-bit: 4 buckets, 4-bit: 16 buckets)
- Write IVF posting lists, codes, and residuals as memory-mapped NPY files
- Optionally store document metadata in a co-located SQLite database
Search Pipeline
- IVF probing — Score query tokens against centroids, select top
n_ivf_probecentroids per token - Candidate retrieval — Collect document IDs from selected posting lists
- Approximate scoring — MaxSim using centroid vectors (fast, coarse)
- Re-ranking — Decompress top
n_full_scorescandidates, compute exact ColBERT MaxSim - Return top
top_kresults with scores
MaxSim scoring is SIMD-accelerated (AVX2 on x86_64, NEON on ARM) and optionally BLAS-accelerated via Apple Accelerate or OpenBLAS.
Update Modes
Incremental updates use three strategies depending on index size:
| Mode | Condition | Behavior |
|---|---|---|
| Rebuild | num_docs <= start_from_scratch (default: 999) |
Load existing embeddings + new, full K-means rebuild |
| Buffer | new_docs < buffer_size (default: 100) |
Assign to existing centroids, buffer for later |
| Expand | new_docs >= buffer_size |
Find outlier embeddings, expand centroids via K-means, re-index buffer + new |
Installation
Feature Flags
| Feature | Platform | Description |
|---|---|---|
| (default) | All | Pure Rust, no external BLAS |
accelerate |
macOS | Apple Accelerate for BLAS (recommended on M-series) |
openblas |
Linux | OpenBLAS for BLAS |
cuda |
Linux/Windows | CUDA acceleration for K-means and MaxSim scoring |
# macOS (recommended)
[]
= { = "1.0", = ["accelerate"] }
# Linux with OpenBLAS
[]
= { = "1.0", = ["openblas"] }
# Linux with CUDA + OpenBLAS
[]
= { = "1.0", = ["cuda", "openblas"] }
# Debian/Ubuntu
# Fedora/RHEL
# Arch
API Reference
MmapIndex
The primary interface. Loads index files as memory-mapped arrays for low RAM usage.
// Load existing index
let index = load?;
// Create or update (creates if missing, appends if exists)
let = update_or_create?;
// Search (single query)
let results = index.search?;
// Search (batch)
let results = index.search_batch?;
// Search within a subset of documents
let results = index.search?;
// Add documents to existing index
let new_ids = index.update?;
// Add documents with metadata
let new_ids = index.update_with_metadata?;
// Delete documents
let deleted = index.delete?;
// Reconstruct embeddings from compressed storage
let embeddings = index.reconstruct?;
let single = index.reconstruct_single?;
// Accessors
index.num_documents;
index.num_embeddings;
index.num_partitions;
index.avg_doclen;
index.embedding_dim;
IndexConfig
Controls index creation.
IndexConfig
SearchParameters
Controls search behavior.
SearchParameters
UpdateConfig
Controls incremental updates.
UpdateConfig
QueryResult
Search results container.
pub type SearchResult = QueryResult;
Metadata
Index metadata, persisted as metadata.json.
ResidualCodec
Quantization codec for compression and decompression.
// Load codec from index directory
let codec = load_from_dir?;
let codec = load_mmap_from_dir?; // memory-mapped centroids
// Compress embeddings to codes (nearest centroid assignments)
let codes = codec.compress_into_codes;
// Quantize residuals
let quantized = codec.quantize_residuals?;
// Decompress back to approximate embeddings
let reconstructed = codec.decompress?;
// Accessors
codec.embedding_dim;
codec.num_centroids;
codec.centroids_view;
Filtering
SQLite-based metadata filtering via the filtering module. Store arbitrary JSON metadata alongside your index, query with SQL WHERE clauses.
use filtering;
use json;
// Create metadata database alongside index
let metadata = vec!;
create?;
// Append more metadata
update?;
// Query by SQL condition (parameterized, injection-safe)
let ids = where_condition?;
// Query with REGEXP support
let ids = where_condition_regexp?;
// Get full metadata rows
let rows = get?;
// Count documents
let count = count?;
// Delete and re-index
delete?;
// Use in search: filter first, then search within subset
let subset = where_condition?;
let results = index.search?;
SQL conditions are validated with a recursive descent parser that whitelists safe operators and prevents injection.
CUDA Acceleration
When built with the cuda feature, NextPlaid automatically uses GPU acceleration for:
- K-means clustering during index creation and centroid expansion
- MaxSim scoring during search (for large enough matrices)
No code changes needed. CUDA falls back gracefully to CPU on failure.
[]
= { = "1.0", = ["cuda"] }
The CUDA module uses cuBLAS for matrix multiplication and custom PTX kernels for argmax operations. A global CudaContext is lazily initialized on first use.
Tip: First CUDA context creation can take 10-30s. Enable GPU persistence mode to reduce this:
sudo nvidia-smi -pm 1
Index File Structure
index_directory/
metadata.json # Index metadata (num_docs, nbits, partitions, etc.)
centroids.npy # Centroid embeddings [K, dim]
avg_residual.npy # Average residual per dimension
bucket_cutoffs.npy # Quantization boundaries
bucket_weights.npy # Reconstruction values
cluster_threshold.npy # Outlier detection threshold
ivf.npy # Inverted file (doc IDs per centroid)
ivf_lengths.npy # Length of each IVF posting list
plan.json # Indexing plan
merged_codes.npy # Memory-mapped centroid codes (auto-merged)
merged_residuals.npy # Memory-mapped quantized residuals (auto-merged)
metadata.db # SQLite metadata database (optional)
# Per-chunk files (merged into merged_*.npy on load):
0.codes.npy # Centroid assignments for chunk 0
0.residuals.npy # Quantized residuals for chunk 0
0.metadata.json # Chunk metadata
doclens.0.json # Document lengths for chunk 0
Modules
| Module | Lines | Description |
|---|---|---|
filtering |
1,896 | SQLite metadata storage, SQL condition validation, REGEXP support |
mmap |
1,779 | Memory-mapped NPY/raw arrays, merge-on-load, file locking |
index |
1,389 | Index creation, MmapIndex, IndexConfig, Metadata |
update |
977 | Incremental updates, buffer/expand strategies |
cuda |
769 | CUDA context, cuBLAS MatMul, PTX argmax kernel |
codec |
701 | Residual quantization, compress/decompress, lookup tables |
search |
714 | IVF probing, candidate retrieval, approximate + exact scoring |
delete |
540 | Document deletion, IVF rebuild |
kmeans |
475 | K-means clustering, centroid computation, partition estimation |
maxsim |
443 | SIMD MaxSim (AVX2/NEON), BLAS matrix multiply |
utils |
237 | Quantile computation, array utilities |
embeddings |
137 | Embedding reconstruction from compressed storage |
error |
66 | Error types |
Dependencies
| Crate | Purpose |
|---|---|
ndarray |
N-dimensional arrays |
rayon |
Parallelism |
memmap2 |
Memory-mapped files |
ndarray-npy |
NPY file I/O |
fastkmeans-rs |
K-means clustering |
rusqlite |
SQLite (bundled) |
half |
Float16 support |
regex |
REGEXP filtering |
cudarc |
CUDA bindings (optional) |
serde / serde_json |
Serialization |
thiserror |
Error handling |
License
Apache-2.0