➤ QuorumRAG

Multi-Retriever Retrieval-Augmented Generation via Quorum Consensus

Query → Multi-Retriever Ensemble → RRF Scoring → Quorum Filtering → Evidence Clustering → LLM Generation

A research implementation of QuorumRAG — a RAG architecture that requires cross-retriever consensus before surfacing evidence, built entirely in Rust with Ollama for local LLM inference.

➤ ⚡ What is QuorumRAG?

QuorumRAG is a retrieval strategy that runs multiple independent retrievers (dense semantic search at different chunk granularities + BM25 keyword search) over the same corpus, then only surfaces evidence that achieves quorum — agreement from at least N retrievers. Evidence clusters are scored using Reciprocal Rank Fusion (RRF) before being passed to the LLM, producing answers grounded in cross-validated evidence rather than the output of a single retriever.

➤ ✨ Why it stands out

Quorum consensus — only evidence agreed upon by multiple retrievers reaches the LLM
RRF scoring — rank-based fusion robust to BM25/cosine scale mismatch
Overlapping chunks — 50% stride prevents answers being split at chunk boundaries
Parallel embedding — batched concurrent requests for fast corpus indexing
Embedding cache — cold start only happens once per retriever configuration
Full eval harness — baseline vs. QuorumRAG recall comparison on every run
Built entirely in Rust — no Python runtime, single binary, production-grade performance

➤ 🧠 Design Decisions

Decision	Why it matters
Reciprocal Rank Fusion	Normalizes scores across retrievers without manual scaling — 1/(k+rank) is robust and well-established (Cormack et al., 2009)
Quorum filtering	Reduces hallucination risk by requiring cross-retriever agreement before evidence reaches the LLM
Multi-granularity dense retrieval	Dense-50, Dense-100, Dense-200 capture answers at different levels of context — fine detail to broad context
BM25 as a quorum voter	Keyword retrieval as a complementary signal to semantic search; if both agree, confidence is higher
Overlapping chunks (50% stride)	Answers near chunk boundaries are captured by at least one window
Embedding cache per retriever	Avoids re-embedding thousands of chunks on every run; cache is keyed by retriever ID including chunk size and overlap

➤ 🏗️ Architecture

Query
  │
  ├─► Dense-50   (ov25)  ─┐
  ├─► Dense-100  (ov50)  ─┤  RRF Scoring
  ├─► Dense-200  (ov100) ─┤  (1 / k + rank)
  └─► BM25-100   (ov50)  ─┘
              │
              ▼
     Embedding Similarity
        Clustering (0.85)
              │
              ▼
     Quorum Filter
     (support ≥ 2 retrievers)
              │
              ▼
     Rank Clusters
     (0.7 × avg_score + 0.3 × support)
              │
              ▼
     Build Context (top 5 clusters,
     all members deduplicated by score)
              │
              ▼
     Ollama LLM Generation

➤ 🧩 Pipeline Modules

Module	Purpose
`corpus`	Loads `.txt` files, chunks with configurable size and overlap
`embedding`	HTTP client for Ollama `nomic-embed-text` embeddings
`retrievers/dense`	Cosine similarity search over embedded chunks
`retrievers/bm25`	Tantivy-powered BM25 keyword search
`clustering`	Greedy cosine similarity clustering of candidates
`quorum`	Filters clusters below the minimum retriever support threshold
`ranking`	Scores clusters by RRF avg + support weighting
`context`	Builds the LLM context string from top-ranked clusters
`generation`	Ollama generation API client
`evaluation`	Word-overlap recall metric, baseline vs. QuorumRAG comparison

➤ 🚀 Quickstart

Prerequisites

Rust (edition 2024)
Ollama running locally at http://localhost:11434
Required models pulled:

ollama pull nomic-embed-text
ollama pull mistral

1) Fetch the corpus

pip install wikipedia-api
python3 scripts/fetch_corpus.py

2) Run eval (builds embedding cache on first run)

cargo run

3) Ask a single question

cargo run -- --query "What is backpropagation?"

➤ 📦 Use as a library

Add it to your project:

[dependencies]
quorumrag = "0.1"

Build a pipeline from a Config and ask questions. The pipeline indexes the corpus (using the embedding cache when available) on build:

use quorumrag::{Config, QuorumRag};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let config: Config = toml::from_str(&std::fs::read_to_string("config.toml")?)?;
    let rag = QuorumRag::build(config).await?;

    // Full RAG: quorum retrieval + generation.
    let answer = rag.answer("What is backpropagation?").await?;
    println!("{answer}");

    // Or inspect the evidence yourself before generating.
    let result = rag.retrieve("What is backpropagation?", true).await?;
    println!("support={}, clusters={}", result.max_support, result.clusters.len());
    Ok(())
}

Config fields such as corpus_dir, cache_dir, the embedding model, RRF constant, and ranking weights are all configurable (with sensible defaults), so the library makes no assumptions about your working directory.

➤ ⚙️ Configuration

config.toml controls the full pipeline:

quorum_threshold = 2      # minimum retrievers that must agree
top_k = 15                # candidates per retriever per query
cluster_threshold = 0.85  # cosine similarity threshold for clustering

[[retrievers]]
retriever_type = "dense"
chunk_size = 50
overlap = 25

[[retrievers]]
retriever_type = "dense"
chunk_size = 100
overlap = 50

[[retrievers]]
retriever_type = "dense"
chunk_size = 200
overlap = 100

[[retrievers]]
retriever_type = "bm25"
chunk_size = 100
overlap = 50

[ollama]
url = "http://localhost:11434"
model = "mistral"

➤ 📊 Eval Results

Evaluated on 20 Wikipedia-based Q&A pairs across CS and ML topics.

System	Recall
Baseline (Dense-50 only)	14 / 20
QuorumRAG (4 retrievers)	19 / 20

QuorumRAG additionally provides a support score (1–4) on every answer, indicating how many independent retrievers agreed on the evidence — a confidence signal the baseline cannot produce.

➤ 🧠 Tech Stack

Rust (edition 2024) — entire implementation
Tokio — async runtime, parallel embedding
Tantivy — BM25 full-text search index
Ollama — local LLM inference (nomic-embed-text, mistral)
Serde / TOML — config and cache serialization
Futures — batched concurrent HTTP embedding

➤ 🛣️ What's next

Confidence-weighted answer generation using support scores
Streaming LLM responses for interactive mode
PyO3 bindings to expose the Rust core to a Python research harness
Additional retrievers (TF-IDF, hybrid sparse-dense)
Ablation study tooling (sweep quorum threshold, chunk size, cluster threshold)
REST API mode for integration with external frontends

➤ Authors

Riad Mukhtarov

➤ License

Licensed under either of MIT or Apache-2.0 at your option.

quorumrag 0.1.0