Module static_model

Expand description

In-process reimplementation of the Model2Vec static embedder.

Replaces the model2vec-rs 0.2 dependency. Reasons:

Parallelism: model2vec_rs::StaticModel::encode_with_args runs pool_ids in a serial inner loop and calls tokenizers::Tokenizer::encode_batch_fast (which spawns its own rayon pool internally). Calling that path from inside an outer rayon par_chunks produced ~60% __psynch_cvwait in our linux-corpus profile — nested rayon scopes parking on each other. This implementation: tokenize ONCE across the full corpus on the unfettered thread pool, then mean-pool every encoding in parallel via a single par_iter. No nesting.
ndarray version: model2vec-rs pinned ndarray 0.15; ripvec-core uses ndarray 0.17. The two Array2<f32> types are not interchangeable. Owning the load path here lets us use the workspace ndarray directly.
Allocator pressure: model2vec-rs builds intermediate Vec<String> clones inside encode_with_args. The local implementation tokenizes from &[&str] references directly.

The file format is the published Model2Vec layout (tokenizer.json + model.safetensors + config.json). Local paths only — if Hub download is needed, pre-stage the files via curl (see crates/ripvec-core/tests/ripvec_port_parity.rs for the recipe).

§Behavioural parity

Identical math to model2vec_rs::StaticModel::encode_with_args:

Truncate input strings by char count = max_tokens * median_token_length (HF tokenizers can be slow on huge strings).
Tokenize via tokenizers::Tokenizer::encode_batch_fast.
Drop UNK tokens.
Truncate token ID list to max_tokens.
Pool: for each token, look up the embedding row (optionally remapped via token_mapping), scale by the per-token weight (default 1.0), accumulate.
Divide by token count; L2-normalize if normalize is set.

Verified by the integration test crates/ripvec-core/tests/ripvec_port_parity.rs which exercises the end-to-end pipeline against minishlab/potion-code-16M.

Structs§

StaticEmbedModel: Loaded Model2Vec static embedder.

Constants§

DEFAULT_MAX_TOKENS: Default token cap per chunk during embedding. Matches the model2vec_rs default; CodeChunks are typically far below this.