VTPL — Vector-Threaded Posting Lists
Fused text + vector search in a single index pass. Parallel indexing, smart multi-level caching.
The idea
Traditional hybrid search runs two separate lookups — a text index and a vector index — then merges results. VTPL embeds PQ-compressed vectors directly inside posting list entries, so text matching and semantic scoring happen in one pass with no second index traversal.
Traditional: trigram "con" → [doc_1, doc_7, doc_42]
VTPL: trigram "con" → [(doc_1, pq₁), (doc_7, pq₇), (doc_42, pq₄₂)]
└── 32-byte PQ-compressed embedding
Score(chunk) = λ · PQ_cosine(q, chunk) + (1 - λ) · pattern_match_score(chunk)
Benchmarks
All benchmarks on real data: AG News articles + all-MiniLM-L6-v2 embeddings (384-dim). Release mode.
Indexing speed
| Documents | Sequential | Parallel (rayon) | Speedup |
|---|---|---|---|
| 1,000 | 54.8 ms | 25.9 ms | 2.1x |
| 10,000 | 706 ms | 352 ms | 2.0x |
Query speed (200 iterations)
| Documents | Fused | Cached fused | Text-only | Vector-only |
|---|---|---|---|---|
| 1,000 | 204 µs | 177 µs (1.2x) | 180 µs | 715 µs |
| 10,000 | 1,902 µs | 1,816 µs | 1,642 µs | 7,805 µs |
Fused is 3.5–4.3x faster than vector-only because it only scans posting lists matching the query trigrams instead of the full index.
Smart cache hit rates (steady state)
| Layer | 1k docs | 10k docs |
|---|---|---|
| Trigram cache | 95% | 95% |
| Word cache | 94% | 95% |
| Semantic cache | 91% | 91% |
The cache doesn't require identical queries — "concurrent hash" and "concurrent programming" share all "concurrent" trigrams. Similar embeddings (same quantization bucket) reuse per-chunk cosine scores.
PQ compression quality
Ground truth = top-k by exact cosine similarity on the raw 384-dim embeddings.
| Documents | PQ Recall@10 | PQ NDCG@10 |
|---|---|---|
| 1,000 | 76.5% | 88.3% |
| 10,000 | 74.5% | 85.5% |
PQ compression into 32 bytes preserves most ranking quality. At larger scale quantization noise grows — higher-dim embeddings or larger corpora may need a bigger PQ budget.
Retrieval quality (Recall@10 vs ground truth)
| Method | 1k docs | 10k docs |
|---|---|---|
| Text-only | 31.0% | 51.5% |
| Fused (λ=0.6) | 54.0% | 64.5% |
| Vector-only | 76.5% | 74.5% |
Fused beats text-only by +74% (1k) and +25% (10k). Vector-only still wins overall on this dataset — the text signal helps when embeddings miss lexical matches, but can add noise on pure-semantic queries. Fused is strongest when queries have both a meaningful text pattern and a semantic intent.
Memory
| Documents | Serialized index | PQ overhead | Total entries |
|---|---|---|---|
| 1,000 | 4.06 MB | 3.11 MB | 101,768 |
| 10,000 | 34.84 MB | 30.39 MB | 995,716 |
32 extra bytes per posting-list entry for the inline PQ code.
When to use VTPL
VTPL is useful when queries have both a text pattern and a semantic component:
- Code search: function name match + semantically related implementations
- Document retrieval: keyword filter + meaning-based ranking
- Log search: pattern grep + "find similar errors"
- E-commerce: product name match + "things like this"
- High-throughput workloads: smart cache gives 95% trigram hit rate on overlapping queries
Don't use VTPL for:
- Pure vector search — use HNSW, it's O(log n) per query
- Pure text search — use a standard inverted index, no PQ overhead needed
- Very large corpora (millions of docs) with high-dim embeddings — the 32-byte PQ budget may not preserve enough ranking quality
Usage
use ;
let codebook = train;
// --- Sequential build ---
let mut index = new;
for in &chunks
index.finalize;
// --- Parallel build (rayon + AtomicU32) ---
let builder = new;
let batch: = /* your docs */;
builder.insert_batch;
let index = builder.build;
// --- Smart cached queries ---
let cached = with_defaults;
// Different queries sharing words reuse cached trigram scans
let r1 = cached.query;
let r2 = cached.query;
// ^ shares all "concurrent" trigrams from cache
println!;
// trigram 95% | word 94% | semantic 91%
// Text-only / vector-only also available
let text_results = cached.inner.text_query;
let vec_results = cached.inner.vector_query;
// Persist to disk
cached.inner.save_to_file?;
let loaded = load_from_file?;
Architecture
src/
├── lib.rs — public API
├── pq.rs — product quantization: codebook training, encode, asymmetric distance tables
├── ngram.rs — character trigram extraction
├── posting.rs — VtplEntry (chunk_id + 32-byte PQ code), PostingList
├── index.rs — VtplIndex: insert, finalize, query, serialization
├── parallel.rs — ParallelBuilder: rayon + DashMap + AtomicU32 for multi-core indexing
└── cache.rs — CachedIndex: trigram/word/semantic 3-level cache with confidence eviction
VtplEntryis#[repr(C)]/ 36 bytes — cache-friendly sequential scans- Asymmetric distance tables: 32 table lookups per candidate, zero float multiplies at query time
- Cosine computed once per chunk on first posting-list encounter
- IDF weighting:
log((N - df + 0.5) / (df + 0.5) + 1) - Parallel indexing: PQ encode + trigram extract across all cores via
rayon, merge intoDashMapwithAtomicU32counters - Smart cache (3 levels):
- Word cache — "concurrent" always decomposes to the same trigrams; reused across queries
- Trigram cache — posting list scan results cached per trigram; overlapping queries share work
- Semantic cache — embeddings quantized into fingerprints; similar vectors reuse cosine scores
- Confidence-based eviction: frequently-accessed entries survive longer
Arc-backed entries: cache hits are near-zero-cost pointer clonesparking_lot::RwLockwith amortized confidence bumps to minimize write contention
Running
# Real-world benchmark (requires Python + sentence-transformers + datasets)
License
MIT