retrieval-kit 0.1.0

A Rust library for local document ingestion, vector search, keyword search, and MCP-style retrieval tool definitions.
Documentation

Retrieval Kit

Retrieval Kit is a small Rust library for local document ingestion, vector search, keyword search, and MCP-style retrieval tool definitions.

It currently ships with:

  • LanceDB storage for documents, chunks, vectors, and full-text search
  • ONNX Runtime embeddings through sentence-transformers/all-MiniLM-L12-v2
  • single-document, batch, file, and glob ingestion
  • semantic search, keyword search, document list/get/delete APIs
  • JSON tool definitions and invocation helpers for retrieval integrations

Example

use retrieval_kit::{
    DbEngine, EmbeddingsConfig, EmbeddingsProviderKind, RKit,
};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rkit = RKit::new(
        DbEngine::LanceDb {
            path: PathBuf::from("./rkit-data"),
            vector_dimensions: 384,
        },
        EmbeddingsProviderKind::Ort(EmbeddingsConfig::default()),
    )?;

    rkit.init().await?;

    let ingested = rkit
        .ingest_document("Rust makes local retrieval systems practical.".to_string())
        .await?;

    let semantic_results = rkit
        .vector_search("local search in Rust".to_string(), 5)
        .await?;
    let keyword_results = rkit.keyword_search("Rust".to_string(), 5).await?;
    let document = rkit.get_document(ingested.document_id).await?;

    println!("{semantic_results:#?}");
    println!("{keyword_results:#?}");
    println!("{document:#?}");

    Ok(())
}

Configuration Notes

The default embedder downloads model assets from Hugging Face unless local paths are supplied in EmbeddingsConfig. To run fully offline, provide all four local asset paths:

  • local_model_path
  • local_tokenizer_path
  • local_pooling_config_path
  • local_transformer_config_path

For the default all-MiniLM-L12-v2 model, set LanceDB vector_dimensions to 384. Initialization validates known embedder dimensions against the database schema and fails early on mismatches.

Ingestion uses tokenizer-aware chunking when the ORT embedder is initialized, so chunks are split before model truncation would drop content. The standalone chunk_text helper remains character based.

Retrieval Tools

get_tool_definitions returns JSON schemas for:

  • semantic_search
  • keyword_search
  • list_documents
  • get_document

Use invoke_tool to dispatch those tools directly from JSON arguments. Search tools default to a limit of 10 when no limit is provided.

Current Limits

  • LanceDB and ORT are the only built-in backend/provider pair.
  • Document metadata and metadata filters are not implemented yet.
  • LanceDB writes are prevalidated and partially cleaned up on insert failure, but they are not fully transactional.
  • Vector indexes are created only after enough rows exist for LanceDB's automatic vector index training. Keyword index and later vector index creation errors are returned to callers.