Expand description
Ripvec retrieval pipeline ported into Rust.
This subtree mirrors the Python reference implementation at
~/src/semble/src/semble/. Each Rust module corresponds to one
Python source file; the port preserves the ripvec pipeline shape
(chunker → tokenizer → BM25 path-enrichment → static encoder →
RRF hybrid → boosts → penalties → reranker) one-for-one.
§Module map
| This module | Python source |
|---|---|
tokens | src/semble/tokens.py (camelCase/snake_case splitter) |
chunking | src/semble/chunking/{core,chunking}.py (AST-merge) |
bm25 | src/semble/index/sparse.py (path-enrichment + scoring) |
dense | src/semble/index/dense.py (StaticEncoder via model2vec-rs) |
ranking | src/semble/ranking/{weighting,boosting}.py (alpha + boosts) |
penalties | src/semble/ranking/penalties.py (path priors + rerank_topk) |
hybrid | src/semble/search.py (RRF + α-blend + boost + rerank) |
index | src/semble/index/index.py (RipvecIndex orchestrator) |
§Scope under --model ripvec
When --model ripvec is active, the orchestrator in index drives
the full pipeline: it builds a RipvecIndex
using the chunker in chunking and the encoder in dense, and
dispatches search via hybrid::search_hybrid. Ripvec’s existing
BM25 in crate::bm25 and hybrid in crate::hybrid are not used
on this path.
Per the port+ripvec scope decision in docs/PLAN.md, the final
ranking step applies ripvec’s
boost_with_pagerank on top
of the ripvec engine’s rerank — making --model ripvec the ripvec engine’s retrieval plus
ripvec’s structural prior.
Modules§
- bm25
- BM25 with ripvec’s stem-doubled path enrichment.
- chunking
- Tree-sitter AST-merge chunker (semble flavor).
- dense
- Static encoder: in-process
StaticEmbedModelreimplementation. - hybrid
- Hybrid search: RRF fusion of semantic + BM25, then boosts and rerank.
- index
RipvecIndexorchestrator and PageRank-layered ranking.- penalties
- File-path penalties +
rerank_topkwith file-saturation decay. - ranking
- Alpha auto-detection and query-driven boosting.
- static_
model - In-process reimplementation of the Model2Vec static embedder.
- tokens
- Identifier tokenizer for BM25 indexing.