Skip to main content

Module ripvec

Module ripvec 

Source
Expand description

Ripvec retrieval pipeline ported into Rust.

This subtree mirrors the Python reference implementation at ~/src/semble/src/semble/. Each Rust module corresponds to one Python source file; the port preserves the ripvec pipeline shape (chunker → tokenizer → BM25 path-enrichment → static encoder → RRF hybrid → boosts → penalties → reranker) one-for-one.

§Module map

This modulePython source
tokenssrc/semble/tokens.py (camelCase/snake_case splitter)
chunkingsrc/semble/chunking/{core,chunking}.py (AST-merge)
bm25src/semble/index/sparse.py (path-enrichment + scoring)
densesrc/semble/index/dense.py (StaticEncoder via model2vec-rs)
rankingsrc/semble/ranking/{weighting,boosting}.py (alpha + boosts)
penaltiessrc/semble/ranking/penalties.py (path priors + rerank_topk)
hybridsrc/semble/search.py (RRF + α-blend + boost + rerank)
indexsrc/semble/index/index.py (RipvecIndex orchestrator)

§Scope under --model ripvec

When --model ripvec is active, the orchestrator in index drives the full pipeline: it builds a RipvecIndex using the chunker in chunking and the encoder in dense, and dispatches search via hybrid::search_hybrid. Ripvec’s existing BM25 in crate::bm25 and hybrid in crate::hybrid are not used on this path.

Per the port+ripvec scope decision in docs/PLAN.md, the final ranking step applies ripvec’s boost_with_pagerank on top of the ripvec engine’s rerank — making --model ripvec the ripvec engine’s retrieval plus ripvec’s structural prior.

Modules§

bm25
BM25 with ripvec’s stem-doubled path enrichment.
chunking
Tree-sitter AST-merge chunker (semble flavor).
dense
Static encoder: in-process StaticEmbedModel reimplementation.
hybrid
Hybrid search: RRF fusion of semantic + BM25, then boosts and rerank.
index
RipvecIndex orchestrator and PageRank-layered ranking.
penalties
File-path penalties + rerank_topk with file-saturation decay.
ranking
Alpha auto-detection and query-driven boosting.
static_model
In-process reimplementation of the Model2Vec static embedder.
tokens
Identifier tokenizer for BM25 indexing.