Table of Contents
Pure-Rust Retrieval-Augmented Generation Pipeline
SIMD-accelerated RAG pipeline built on Trueno compute primitives. Part of the Sovereign AI Stack.
Features
- Pure Rust - Zero Python/C++ dependencies
- Chunking - Recursive, Fixed, Sentence, Paragraph, Semantic, Structural
- Hybrid Retrieval - Dense (vector) + Sparse (BM25) search
- Fusion - RRF, Linear, DBSF, Convex, Union, Intersection
- Reranking - Lexical, cross-encoder, and composite rerankers
- Metrics - Recall, Precision, MRR, NDCG, MAP
- Semantic Embeddings - Production ONNX models via FastEmbed (optional)
- Nemotron Embeddings - NVIDIA Embed Nemotron 8B via GGUF (optional)
- Index Compression - LZ4/ZSTD compressed persistence (optional)
Installation
[]
= "0.1.8"
Usage
use ;
let mut pipeline = new
.chunker
.embedder
.reranker
.fusion
.build?;
let doc = new.with_title;
pipeline.index_document?;
let = pipeline.query_with_context?;
Examples
# Basic examples
# With semantic embeddings (downloads ~90MB ONNX model on first run)
# With compressed index persistence
# With NVIDIA Nemotron embeddings (requires GGUF model file)
NEMOTRON_MODEL_PATH=/path/to/model.gguf
API Reference
Semantic Embeddings (FastEmbed)
Production-quality vector embeddings via FastEmbed (ONNX Runtime):
= { = "0.1.8", = ["embeddings"] }
use ;
let embedder = new?;
let embedding = embedder.embed?;
// 384-dimensional embeddings
Available models:
AllMiniLmL6V2- Fast, 384 dims (default)AllMiniLmL12V2- Better quality, 384 dimsBgeSmallEnV15- Balanced, 384 dimsBgeBaseEnV15- Higher quality, 768 dimsNomicEmbedTextV1- Retrieval optimized, 768 dims
NVIDIA Embed Nemotron 8B
High-quality 4096-dimensional embeddings via GGUF model inference:
= { = "0.1.8", = ["nemotron"] }
use ;
let config = new
.with_gpu
.with_normalize;
let embedder = new?;
// Asymmetric retrieval - different prefixes for queries vs documents
let query_emb = embedder.embed_query?;
let doc_emb = embedder.embed_document?;
Index Compression
LZ4/ZSTD compressed index persistence:
= { = "0.1.8", = ["compression"] }
use ;
let bytes = index.to_compressed_bytes?;
// 4-6x compression ratio
Architecture
┌─────────────────────────────────────────────┐
│ RAG Pipeline API │
│ (RagPipelineBuilder, query) │
├──────────┬──────────┬───────────────────────┤
│ Chunking │ Embedding│ Retrieval │
│ (6 modes)│ (ONNX/ │ (Dense + BM25) │
│ │ GGUF) │ │
├──────────┴──────────┴───────────────────────┤
│ Fusion & Reranking │
│ (RRF, Linear, DBSF, Lexical, Cross-Enc) │
├─────────────────────────────────────────────┤
│ Storage & Indexing │
│ (BM25 inverted index, vector store, SQLite) │
├─────────────────────────────────────────────┤
│ Trueno SIMD Compute Primitives │
└─────────────────────────────────────────────┘
- Chunking Layer: Recursive, Fixed, Sentence, Paragraph, Semantic, and Structural chunkers
- Embedding Layer: Mock (testing), FastEmbed (ONNX), Nemotron (GGUF) embedders
- Retrieval Layer: Dense vector similarity + BM25 sparse retrieval with hybrid fusion
- Fusion/Reranking: RRF, Linear, DBSF, Convex combination; lexical and cross-encoder rerankers
- Storage: In-memory BM25 index with optional LZ4/ZSTD persistence and SQLite backend
Testing
Property-based tests cover chunking boundary conditions, BM25 scoring invariants, and fusion correctness.
Stack Dependencies
trueno-rag is part of the Sovereign AI Stack:
| Crate | Version | Purpose |
|---|---|---|
| trueno | 0.11 | SIMD/GPU compute primitives |
| trueno-db | 0.3.10 | GPU-first analytics database |
| realizar | 0.5.1 | GGUF/APR model inference |
| fastembed | 5.x | ONNX embeddings |
Development
Documentation
Contributing
Contributions are welcome! Please see the CONTRIBUTING.md guide for details.
MSRV
Minimum Supported Rust Version: 1.75
License
MIT