Skip to main content

Module dense

Module dense 

Source
Expand description

Static encoder: in-process StaticEmbedModel reimplementation.

Port of ~/src/semble/src/semble/index/dense.py. Wraps StaticEmbedModel loaded with minishlab/potion-code-16M (256-dim, L2-normalized). Implements VectorEncoder for the --model ripvec path. CPU-only; no batching ring buffer.

§Why not model2vec-rs?

The previous wave used the upstream model2vec-rs crate. Two real problems pushed us to reimplement (see crates/ripvec-core/src/encoder/semble/static_model.rs for the full design rationale):

  1. model2vec_rs::StaticModel::encode_with_args runs pool_ids in a serial inner loop while tokenizers::encode_batch_fast spawns its own rayon pool. Wrapping that path in our outer par_chunks produced 60% __psynch_cvwait in the linux-corpus profile — nested rayon scopes parking on each other. The reimplementation does ONE big tokenize plus a par_iter over pool_ids — no nested rayon, no parking.
  2. model2vec-rs 0.2 pinned ndarray 0.15; ripvec-core uses ndarray 0.17. The two Array2<f32> types were not interchangeable, forcing a Vec<Vec<f32>> shim. Owning the load path eliminates the mismatch.

Structs§

StaticEncoder
CPU-only static encoder.

Constants§

DEFAULT_HIDDEN_DIM
Default hidden dimension for DEFAULT_MODEL_REPO.
DEFAULT_MODEL_REPO
Default model repo identifier for the ripvec path. This is the HF repo string used as identity(); the loader reads files from a local path passed via --model-repo.