Module colbert

Module colbert 

Source
Expand description

ColBERT-style Multi-Vector Search

Late interaction model for dense retrieval with token-level matching.

§Algorithm Overview

ColBERT (Contextualized Late Interaction over BERT) represents documents as collections of token embeddings and uses MaxSim for scoring:

  1. Each document/query → sequence of token embeddings
  2. Score = Σ max(sim(q_token, d_token)) for all query tokens
  3. “Late interaction”: token-level matching instead of single vector

§Benefits

  • Fine-grained matching: Matches specific parts of documents
  • Better accuracy: Captures more semantic nuance than single vectors
  • Interpretability: Can identify which tokens matched

§Example

use oxify_vector::colbert::{ColbertIndex, ColbertConfig};
use std::collections::HashMap;

let config = ColbertConfig::default();
let mut index = ColbertIndex::new(config);

// Each document has multiple token embeddings
let mut doc_tokens = HashMap::new();
doc_tokens.insert("doc1".to_string(), vec![
    vec![0.1, 0.2, 0.3],
    vec![0.2, 0.3, 0.4],
    vec![0.3, 0.4, 0.5],
]);

index.build(&doc_tokens)?;

let query_tokens = vec![
    vec![0.15, 0.25, 0.35],
    vec![0.25, 0.35, 0.45],
];

let results = index.search(&query_tokens, 10)?;

Structs§

ColbertConfig
ColBERT configuration
ColbertIndex
ColBERT index for multi-vector search
ColbertSearchResult
ColBERT search result with token-level match information
ColbertStats
ColBERT index statistics
MultiVectorDoc
Multi-vector representation of a document