retrieval-kit 0.1.0

A Rust library for local document ingestion, vector search, keyword search, and MCP-style retrieval tool definitions.
Documentation
# Retrieval Kit

Retrieval Kit is a small Rust library for local document ingestion, vector search,
keyword search, and MCP-style retrieval tool definitions.

It currently ships with:

- LanceDB storage for documents, chunks, vectors, and full-text search
- ONNX Runtime embeddings through `sentence-transformers/all-MiniLM-L12-v2`
- single-document, batch, file, and glob ingestion
- semantic search, keyword search, document list/get/delete APIs
- JSON tool definitions and invocation helpers for retrieval integrations

## Example

```rust
use retrieval_kit::{
    DbEngine, EmbeddingsConfig, EmbeddingsProviderKind, RKit,
};
use std::path::PathBuf;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rkit = RKit::new(
        DbEngine::LanceDb {
            path: PathBuf::from("./rkit-data"),
            vector_dimensions: 384,
        },
        EmbeddingsProviderKind::Ort(EmbeddingsConfig::default()),
    )?;

    rkit.init().await?;

    let ingested = rkit
        .ingest_document("Rust makes local retrieval systems practical.".to_string())
        .await?;

    let semantic_results = rkit
        .vector_search("local search in Rust".to_string(), 5)
        .await?;
    let keyword_results = rkit.keyword_search("Rust".to_string(), 5).await?;
    let document = rkit.get_document(ingested.document_id).await?;

    println!("{semantic_results:#?}");
    println!("{keyword_results:#?}");
    println!("{document:#?}");

    Ok(())
}
```

## Configuration Notes

The default embedder downloads model assets from Hugging Face unless local paths
are supplied in `EmbeddingsConfig`. To run fully offline, provide all four local
asset paths:

- `local_model_path`
- `local_tokenizer_path`
- `local_pooling_config_path`
- `local_transformer_config_path`

For the default `all-MiniLM-L12-v2` model, set LanceDB `vector_dimensions` to
`384`. Initialization validates known embedder dimensions against the database
schema and fails early on mismatches.

Ingestion uses tokenizer-aware chunking when the ORT embedder is initialized, so
chunks are split before model truncation would drop content. The standalone
`chunk_text` helper remains character based.

## Retrieval Tools

`get_tool_definitions` returns JSON schemas for:

- `semantic_search`
- `keyword_search`
- `list_documents`
- `get_document`

Use `invoke_tool` to dispatch those tools directly from JSON arguments. Search
tools default to a limit of 10 when no limit is provided.

## Current Limits

- LanceDB and ORT are the only built-in backend/provider pair.
- Document metadata and metadata filters are not implemented yet.
- LanceDB writes are prevalidated and partially cleaned up on insert failure, but
  they are not fully transactional.
- Vector indexes are created only after enough rows exist for LanceDB's automatic
  vector index training. Keyword index and later vector index creation errors are
  returned to callers.