Expand description
Static encoder: in-process StaticEmbedModel reimplementation.
Port of ~/src/semble/src/semble/index/dense.py. Wraps
StaticEmbedModel loaded with minishlab/potion-base-32M
(256-dim, L2-normalized). Implements VectorEncoder for the
--model ripvec path. CPU-only; no batching ring buffer.
Default was bumped to potion-base-32M in v1.3.0 after the
gutenberg + python-repos matrix showed 32M winning prose by
0.058 NDCG@10 while losing code by only 0.004 — a clear
single-default win once the i64 mapping bug and the reranker
pooler / sigmoid / truncation bugs were fixed. The code-tuned
potion-code-16M is still available via --model-repo.
§Why not model2vec-rs?
The previous wave used the upstream model2vec-rs crate. Two real
problems pushed us to reimplement (see
crates/ripvec-core/src/encoder/semble/static_model.rs for the
full design rationale):
model2vec_rs::StaticModel::encode_with_argsrunspool_idsin a serial inner loop whiletokenizers::encode_batch_fastspawns its own rayon pool. Wrapping that path in our outerpar_chunksproduced 60%__psynch_cvwaitin the linux-corpus profile — nested rayon scopes parking on each other. The reimplementation does ONE big tokenize plus apar_iteroverpool_ids— no nested rayon, no parking.model2vec-rs 0.2pinnedndarray 0.15; ripvec-core usesndarray 0.17. The twoArray2<f32>types were not interchangeable, forcing aVec<Vec<f32>>shim. Owning the load path eliminates the mismatch.
Structs§
- Static
Encoder - CPU-only static encoder.
Constants§
- DEFAULT_
HIDDEN_ DIM - Default hidden dimension for
DEFAULT_MODEL_REPO. - DEFAULT_
MODEL_ REPO - Default model repo identifier for the ripvec path. This is the HF
repo string used as
identity(); the loader reads files from a local path passed via--model-repo.