Expand description
Quantization-aware Search Plan
This module provides a formal runtime plan for vector search that separates policy (what to optimize for) from mechanism (how to execute).
§Architecture
SearchRequest + SLA → Planner → SearchPlan → Executor → Results
↑
Cost Model + Statistics§Policy vs Mechanism
Policy (what to optimize):
- Target recall@k (e.g., 0.95)
- Latency budget (e.g., 5ms p99)
- Token/compute budget
Mechanism (how to execute):
- BPS coarse scan parameters
- PQ scoring parameters
- Rerank depth and method
- ef_search value
- Filter evaluation order
§Cost Model
The planner uses measured per-stage costs:
cost_bps(N, D)= N × D × c_bpscost_pq(N, D, M)= N × M × c_pqcost_rerank(N, D)= N × D × c_f32
§Optimization
Minimize expected latency subject to:
- recall@k ≥ target_recall
- total_cost ≤ budget
Uses bandit-like adaptation based on recent query statistics.
Structs§
- Cost
Model - Cost model parameters (calibrated per hardware).
- Dataset
Stats - Statistics about the dataset for planning.
- Pipeline
Stage - A single stage in the search pipeline.
- Plan
Executor - Plan executor that runs a search plan.
- Search
Plan - The search plan: a complete specification for executing a search.
- Search
Planner - Search planner that generates optimal plans.
- SearchSLA
- Service Level Agreement for search.
- Stage
Costs - Per-stage cost measurements.
Enums§
- Optimization
Mode - Optimization mode for the planner.
- Plan
Error - Plan validation errors.
- Stage
Quant Level - Quantization level for a pipeline stage.