Dakera Inference Engine

Embedded inference engine for generating vector embeddings locally without external API calls. This crate provides:

Local Embedding Generation: Generate embeddings using state-of-the-art transformer models running locally on CPU or GPU.
Multiple Model Support: Choose from MiniLM (fast), BGE (balanced), or E5 (quality).
Batch Processing: Efficient batch processing with automatic batching and parallelization.
Zero External Dependencies: No OpenAI, Cohere, or other API keys required.

Quick Start

use inference::{EmbeddingEngine, ModelConfig, EmbeddingModel};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create engine with default model (MiniLM)
    let engine = EmbeddingEngine::new(ModelConfig::default()).await?;

    // Embed a query
    let query_embedding = engine.embed_query("What is machine learning?").await?;
    println!("Query embedding: {} dimensions", query_embedding.len());

    // Embed documents
    let docs = vec![
        "Machine learning is a type of artificial intelligence.".to_string(),
        "Deep learning uses neural networks with many layers.".to_string(),
    ];
    let doc_embeddings = engine.embed_documents(&docs).await?;
    println!("Generated {} document embeddings", doc_embeddings.len());

    Ok(())
}

Model Selection

Choose the right model for your use case:

Model	Speed	Quality	Use Case
MiniLM	⚡⚡⚡	⭐⭐	High-throughput, real-time
BGE-small	⚡⚡	⭐⭐⭐	Balanced performance
E5-small	⚡⚡	⭐⭐⭐	Best quality for retrieval

GPU Acceleration

Enable GPU acceleration by building with the appropriate feature:

# For NVIDIA GPUs
inference = { path = "crates/inference", features = ["cuda"] }

# For Apple Silicon
inference = { path = "crates/inference", features = ["metal"] }

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    EmbeddingEngine                          │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────────┐   │
│  │ ModelConfig │  │ BatchProcessor│  │  ort::Session    │   │
│  │ - model     │  │ - tokenizer  │  │ (ONNX Runtime)   │   │
│  │ - threads   │  │ - batching   │  │ - BERT INT8      │   │
│  │ - batch_sz  │  │ - prefixes   │  │ - mean_pool()    │   │
│  └─────────────┘  └──────────────┘  └──────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
             ┌───────────────────────────────┐
             │      Vec<f32> Embeddings      │
             │  (normalized, model-dim dims) │
             └───────────────────────────────┘

dakera-inference 0.11.81

Dakera Inference Engine

Quick Start

Model Selection

GPU Acceleration

Architecture