Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Mullama
Comprehensive Rust bindings for llama.cpp with advanced integration features
Mullama provides memory-safe Rust bindings for llama.cpp with production-ready features including async/await support, real-time streaming, multimodal processing, and web framework integration.
Why Mullama?
Most llama.cpp Rust bindings expose low-level C APIs directly. Mullama provides an idiomatic Rust experience:
// Other wrappers: manual memory management, raw pointers, verbose setup
let params = llama_context_default_params;
let ctx = unsafe ;
let tokens = unsafe ;
// Don't forget to free everything...
// Mullama: builder patterns, async/await, automatic resource management
let model = new
.path
.gpu_layers
.build.await?;
let response = model.generate.await?;
Developer experience improvements:
| Feature | Other Wrappers | Mullama |
|---|---|---|
| API Style | Raw FFI / C-like | Builder patterns, fluent API |
| Async Support | Manual threading | Native async/await with Tokio |
| Error Handling | Error codes / panics | Result<T, MullamaError> with context |
| Memory Management | Manual free/cleanup | Automatic RAII |
| Streaming | Callbacks | Stream trait, async iterators |
| Configuration | Struct fields | Type-safe builders with validation |
| Web Integration | DIY | Built-in Axum routes |
Key Features
- Async/Await Native - Full Tokio integration for non-blocking operations
- Real-time Streaming - Token-by-token generation with backpressure handling
- Multimodal Processing - Text, image, and audio in a single pipeline
- Late Interaction / ColBERT - Multi-vector embeddings with MaxSim scoring for retrieval
- Web Framework Ready - Direct Axum integration with REST APIs
- WebSocket Support - Real-time bidirectional communication
- Parallel Processing - Work-stealing parallelism for batch operations
- GPU Acceleration - CUDA, Metal, ROCm, and OpenCL support
- Memory Safe - Zero unsafe operations in public API
Quick Start
Installation
[]
= "0.1.1"
# With all features
= { = "0.1.1", = ["full"] }
Prerequisites
Linux (Ubuntu/Debian):
macOS:
Windows: Install Visual Studio Build Tools and CMake.
See Platform Setup Guide for detailed instructions.
Basic Example
use *;
async
Feature Flags
[]
= "0.1.1"
= [
"async", # Async/await support
"streaming", # Token streaming
"web", # Axum web framework
"websockets", # WebSocket support
"multimodal", # Image and audio processing
"streaming-audio", # Real-time audio capture
"format-conversion", # Audio/image format conversion
"parallel", # Rayon parallel processing
"late-interaction", # ColBERT-style multi-vector embeddings
"daemon", # Daemon mode with TUI client
"full" # All features
]
Common Combinations
# Web applications
= ["web", "websockets", "async", "streaming"]
# Multimodal AI
= ["multimodal", "streaming-audio", "format-conversion"]
# High-performance batch processing
= ["parallel", "async"]
# Semantic search / RAG with ColBERT-style retrieval
= ["late-interaction", "parallel"]
# Daemon with TUI chat interface
= ["daemon"]
Daemon Mode
Mullama includes a multi-model daemon with OpenAI-compatible HTTP API and TUI client:
# Build the CLI
# Start daemon with local model
# Start with HuggingFace model (auto-downloads and caches)
# Multiple models with custom aliases
# Interactive TUI chat
# One-shot generation
# Model management
# Search for models on HuggingFace
# Cache management
# Use OpenAI-compatible API
HuggingFace Model Format
hf:<owner>/<repo>:<filename> # Specific file
hf:<owner>/<repo> # Auto-detect best GGUF
<alias>:hf:<owner>/<repo> # With custom alias
Environment Variables
| Variable | Description |
|---|---|
HF_TOKEN |
HuggingFace token for gated/private models |
MULLAMA_CACHE_DIR |
Override default cache directory |
Cache Locations (Cross-Platform)
| Platform | Default Location |
|---|---|
| Linux | $XDG_CACHE_HOME/mullama/models or ~/.cache/mullama/models |
| macOS | ~/Library/Caches/mullama/models |
| Windows | %LOCALAPPDATA%\mullama\models |
Architecture:
┌──────────────────────────────────┐
│ Daemon │
┌─────────────┐ │ ┌────────────────────────────┐ │
│ TUI Client │◄── nng (IPC) ─────►│ │ Model Manager │ │
└─────────────┘ │ │ ┌───────┐ ┌───────┐ │ │
│ │ │Model 1│ │Model 2│ ... │ │
┌─────────────┐ │ │ └───────┘ └───────┘ │ │
│ curl/app │◄── HTTP/REST ─────►│ └────────────────────────────┘ │
└─────────────┘ (OpenAI API) │ │
│ Endpoints: │
┌─────────────┐ │ • /v1/chat/completions │
│ Other Client│◄── nng (IPC) ─────►│ • /v1/completions │
└─────────────┘ │ • /v1/models │
│ • /v1/embeddings │
└──────────────────────────────────┘
Programmatic usage:
use ;
// Connect as client
let client = connect_default?;
let result = client.chat?;
println!;
// List models
for model in client.list_models?
Late Interaction / ColBERT
Mullama supports ColBERT-style late interaction retrieval with multi-vector embeddings. Unlike traditional embeddings that pool all tokens into a single vector, late interaction preserves per-token embeddings for fine-grained matching using MaxSim scoring.
use ;
use Arc;
// Create generator (works with any embedding model)
let model = new;
let config = default
.normalize
.skip_special_tokens;
let mut generator = new?;
// Generate multi-vector embeddings
let query = generator.embed_text?;
let doc = generator.embed_text?;
// Score with MaxSim
let score = max_sim;
// Top-k retrieval
let documents: = texts.iter
.map
.?;
let top_k = find_top_k;
With parallel processing:
// Enable both features: ["late-interaction", "parallel"]
let top_k = find_top_k_parallel;
let scores = batch_score_parallel;
Recommended models:
LiquidAI/LFM2-ColBERT-350M-GGUF- Purpose-trained ColBERT model- Any GGUF embedding model (works but suboptimal for retrieval)
GPU Acceleration
# NVIDIA CUDA
# Apple Metal (macOS)
# AMD ROCm (Linux)
# Intel OpenCL
Documentation
| Document | Description |
|---|---|
| Getting Started | Installation and first application |
| Platform Setup | OS-specific setup instructions |
| Features Guide | Integration features overview |
| Use Cases | Real-world application examples |
| API Reference | Complete API documentation |
| Sampling Guide | Sampling strategies and configuration |
| GPU Guide | GPU acceleration setup |
| Feature Status | Implementation status and roadmap |
Examples
# Basic text generation
# Streaming responses
# Web service
# Audio processing
# Late interaction / ColBERT retrieval
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
License
MIT License - see LICENSE for details.
llama.cpp Compatibility
Mullama tracks upstream llama.cpp releases:
| Mullama Version | llama.cpp Version | Release Date |
|---|---|---|
| 0.1.x | b7542 | Dec 2025 |
Supported Model Architectures
All architectures supported by llama.cpp b7542, including:
- LLaMA 1/2/3, Mistral, Mixtral, Phi-1/2/3/4
- Qwen, Qwen2, DeepSeek, Yi, Gemma
- And many more