ripvec-core 2.0.0

Semantic code + document search engine. Cacheless static-embedding + cross-encoder rerank by default; optional ModernBERT/BGE transformer engines with GPU backends. Tree-sitter chunking, hybrid BM25 + PageRank, composable ranking layers.
Documentation
//! Encoder abstraction above [`EmbedBackend`](crate::backend::EmbedBackend).
//!
//! [`VectorEncoder`] hides the difference between transformer and static-table
//! encoders behind one interface, so downstream search code (CLI dispatch,
//! [`HybridIndex`](crate::hybrid::HybridIndex), cache layer) does not branch
//! on encoder family.
//!
//! ## Two implementations
//!
//! - [`BertEncoder`] (P0.3) — wraps `Vec<Box<dyn EmbedBackend>>` + tokenizer.
//!   Used for `--model bert` and `--model modernbert`. Owns the existing
//!   walk/chunk/tokenize/embed streaming pipeline.
//!
//! - [`StaticEncoder`](crate::encoder::ripvec::dense::StaticEncoder) (P1.5) —
//!   wraps [`model2vec::Model2Vec`]. Used for `--model ripvec`. CPU-only;
//!   no batching or ring buffer (table-lookup encoder is memory-bound, not
//!   compute-bound).
//!
//! ## Design rationale
//!
//! Each implementation owns its full pipeline because transformer and static
//! encoders have fundamentally different compute shapes:
//!
//! | | BERT | static |
//! |---|---|---|
//! | Tokenizer | HuggingFace BPE/WordPiece | model2vec internal |
//! | Inference | multi-layer attention + GEMM | embedding-table lookup |
//! | Scheduler | rayon clones (CPU) / ring buffer (GPU) | single-threaded encode |
//! | Hidden dim | 384 / 768 | 256 |
//!
//! Forcing a uniform "tokenize then encode" abstraction would either lie
//! about static encoders (no real tokens to expose) or impose transformer
//! ceremony on a lookup table. `VectorEncoder` instead abstracts at the
//! repo→(chunks, embeddings) boundary, where the shapes naturally agree.
//!
//! See `docs/PLAN.md` cluster P0 for the broader port architecture.

use std::path::Path;

use crate::chunk::CodeChunk;
use crate::embed::SearchConfig;
use crate::profile::Profiler;

pub mod bert;
pub mod ripvec;

pub use bert::BertEncoder;

/// Trait that abstracts text/chunks → embedding vectors.
///
/// Implementations own their full pipeline (walk, chunk, tokenize, encode)
/// since transformer-family and static-table encoders have fundamentally
/// different compute shapes (see module-level docs).
///
/// # Object safety
///
/// `dyn VectorEncoder` is constructible. Methods take `&self` and use only
/// concrete return types — no associated types or generic methods.
///
/// # Thread safety
///
/// `Send + Sync` is required because the encoder is shared across the
/// indexing pipeline's rayon and channel-based workers.
pub trait VectorEncoder: Send + Sync {
    /// Walk `root`, chunk every supported file, and embed every chunk.
    ///
    /// Returns the chunks and their embeddings in parallel order: chunk `i`
    /// has embedding `embeddings[i]`. Implementations choose their own
    /// chunker (BERT uses ripvec's tree-sitter chunker; ripvec uses
    /// ripvec's AST-merge chunker — they emit different chunk shapes,
    /// both projected onto [`CodeChunk`]).
    ///
    /// `cfg` carries pipeline tuning (batch size, token caps, walk filters).
    /// Static encoders ignore the transformer-specific fields (`batch_size`,
    /// `max_tokens`) but still consult walk-related fields.
    ///
    /// # Errors
    ///
    /// Returns an error if file walking, chunking, tokenization, or
    /// inference fails.
    fn embed_root(
        &self,
        root: &Path,
        cfg: &SearchConfig,
        profiler: &Profiler,
    ) -> crate::Result<(Vec<CodeChunk>, Vec<Vec<f32>>)>;

    /// Hidden dimension of the emitted embeddings.
    ///
    /// Used by [`SearchIndex`](crate::index::SearchIndex) for the embedding
    /// matrix shape, and by the cache layer to refuse cross-family loads
    /// (a 256-dim semble index cannot be queried by a 768-dim ModernBERT
    /// query).
    fn hidden_dim(&self) -> usize;

    /// Stable identifier used as the cache-manifest key.
    ///
    /// For HuggingFace-backed encoders, the model repo string (e.g.
    /// `"nomic-ai/modernbert-embed-base"`, `"minishlab/potion-code-16M"`).
    /// The ripvec engine path does not write the cache; this is still consulted
    /// for logging and diagnostics.
    fn identity(&self) -> &str;
}

#[cfg(test)]
mod tests {
    use super::*;

    /// Verify that `VectorEncoder` is object-safe by constructing a trait
    /// object type. Compilation is the test.
    #[test]
    fn trait_is_object_safe() {
        fn assert_object_safe(_: &dyn VectorEncoder) {}
        // Constructing the function item is the load-bearing check;
        // referencing it keeps the type-check live across dead-code analysis.
        let _ = assert_object_safe;
    }

    /// Verify that `Box<dyn VectorEncoder>` is `Send` + `Sync`.
    #[test]
    fn trait_object_is_send_and_sync() {
        fn assert_send_sync<T: Send + Sync>() {}
        assert_send_sync::<Box<dyn VectorEncoder>>();
    }

    /// Verify that `&dyn VectorEncoder` is `Send` (parallel pipelines).
    #[test]
    fn shared_reference_is_send() {
        fn assert_send<T: Send>() {}
        assert_send::<&dyn VectorEncoder>();
    }
}