Expand description
In-process reimplementation of the Model2Vec static embedder.
Replaces the model2vec-rs 0.2 dependency. Reasons:
-
Parallelism:
model2vec_rs::StaticModel::encode_with_argsrunspool_idsin a serial inner loop and callstokenizers::Tokenizer::encode_batch_fast(which spawns its own rayon pool internally). Calling that path from inside an outer rayonpar_chunksproduced ~60%__psynch_cvwaitin our linux-corpus profile — nested rayon scopes parking on each other. This implementation: tokenize ONCE across the full corpus on the unfettered thread pool, then mean-pool every encoding in parallel via a singlepar_iter. No nesting. -
ndarray version: model2vec-rs pinned
ndarray 0.15; ripvec-core usesndarray 0.17. The twoArray2<f32>types are not interchangeable. Owning the load path here lets us use the workspace ndarray directly. -
Allocator pressure: model2vec-rs builds intermediate
Vec<String>clones insideencode_with_args. The local implementation tokenizes from&[&str]references directly.
The file format is the published Model2Vec layout (tokenizer.json +
model.safetensors + config.json). Local paths only — if Hub download
is needed, pre-stage the files via curl (see
crates/ripvec-core/tests/ripvec_port_parity.rs for the recipe).
§Behavioural parity
Identical math to model2vec_rs::StaticModel::encode_with_args:
- Truncate input strings by char count =
max_tokens * median_token_length(HF tokenizers can be slow on huge strings). - Tokenize via
tokenizers::Tokenizer::encode_batch_fast. - Drop UNK tokens.
- Truncate token ID list to
max_tokens. - Pool: for each token, look up the embedding row (optionally remapped
via
token_mapping), scale by the per-token weight (default 1.0), accumulate. - Divide by token count; L2-normalize if
normalizeis set.
Verified by the integration test
crates/ripvec-core/tests/ripvec_port_parity.rs which exercises the
end-to-end pipeline against minishlab/potion-code-16M.
Structs§
- Static
Embed Model - Loaded Model2Vec static embedder.
Constants§
- DEFAULT_
MAX_ TOKENS - Default token cap per chunk during embedding. Matches the
model2vec_rsdefault; CodeChunks are typically far below this.