# Mullama
**Comprehensive Rust bindings for llama.cpp with advanced integration features**
[](https://crates.io/crates/mullama)
[](https://docs.rs/mullama)
[](LICENSE)
Mullama provides memory-safe Rust bindings for llama.cpp with production-ready features including async/await support, real-time streaming, multimodal processing, and web framework integration.
## Why Mullama?
Most llama.cpp Rust bindings expose low-level C APIs directly. Mullama provides an **idiomatic Rust experience**:
```rust
// Other wrappers: manual memory management, raw pointers, verbose setup
let params = llama_context_default_params();
let ctx = unsafe { llama_new_context_with_model(model, params) };
let tokens = unsafe { llama_tokenize(model, text.as_ptr(), ...) };
// Don't forget to free everything...
// Mullama: builder patterns, async/await, automatic resource management
let model = ModelBuilder::new()
.path("model.gguf")
.gpu_layers(35)
.build().await?;
let response = model.generate("Hello", 100).await?;
```
**Developer experience improvements:**
| API Style | Raw FFI / C-like | Builder patterns, fluent API |
| Async Support | Manual threading | Native async/await with Tokio |
| Error Handling | Error codes / panics | `Result<T, MullamaError>` with context |
| Memory Management | Manual free/cleanup | Automatic RAII |
| Streaming | Callbacks | `Stream` trait, async iterators |
| Configuration | Struct fields | Type-safe builders with validation |
| Web Integration | DIY | Built-in Axum routes |
## Key Features
- **Async/Await Native** - Full Tokio integration for non-blocking operations
- **Real-time Streaming** - Token-by-token generation with backpressure handling
- **Multimodal Processing** - Text, image, and audio in a single pipeline
- **Late Interaction / ColBERT** - Multi-vector embeddings with MaxSim scoring for retrieval
- **Web Framework Ready** - Direct Axum integration with REST APIs
- **WebSocket Support** - Real-time bidirectional communication
- **Parallel Processing** - Work-stealing parallelism for batch operations
- **GPU Acceleration** - CUDA, Metal, ROCm, and OpenCL support
- **Memory Safe** - Zero unsafe operations in public API
## Quick Start
### Installation
```toml
[dependencies]
mullama = "0.1.1"
# With all features
mullama = { version = "0.1.1", features = ["full"] }
```
### Prerequisites
**Linux (Ubuntu/Debian):**
```bash
sudo apt install -y build-essential cmake pkg-config libasound2-dev libpulse-dev
```
**macOS:**
```bash
brew install cmake pkg-config portaudio
```
**Windows:** Install [Visual Studio Build Tools](https://visualstudio.microsoft.com/downloads/) and [CMake](https://cmake.org/download/).
See [Platform Setup Guide](./docs/PLATFORM_SETUP.md) for detailed instructions.
### Basic Example
```rust
use mullama::prelude::*;
#[tokio::main]
async fn main() -> Result<(), MullamaError> {
let model = ModelBuilder::new()
.path("model.gguf")
.context_size(4096)
.build().await?;
let response = model.generate("The future of AI is", 100).await?;
println!("{}", response);
Ok(())
}
```
## Feature Flags
```toml
[dependencies.mullama]
version = "0.1.1"
features = [
"async", # Async/await support
"streaming", # Token streaming
"web", # Axum web framework
"websockets", # WebSocket support
"multimodal", # Image and audio processing
"streaming-audio", # Real-time audio capture
"format-conversion", # Audio/image format conversion
"parallel", # Rayon parallel processing
"late-interaction", # ColBERT-style multi-vector embeddings
"daemon", # Daemon mode with TUI client
"full" # All features
]
```
### Common Combinations
```toml
# Web applications
features = ["web", "websockets", "async", "streaming"]
# Multimodal AI
features = ["multimodal", "streaming-audio", "format-conversion"]
# High-performance batch processing
features = ["parallel", "async"]
# Semantic search / RAG with ColBERT-style retrieval
features = ["late-interaction", "parallel"]
# Daemon with TUI chat interface
features = ["daemon"]
```
## Daemon Mode
Mullama includes a multi-model daemon with OpenAI-compatible HTTP API and TUI client:
```bash
# Build the CLI
cargo build --release --features daemon
# Start daemon with local model
mullama serve --model llama:./llama.gguf
# Start with HuggingFace model (auto-downloads and caches)
mullama serve --model hf:TheBloke/Llama-2-7B-GGUF
# Multiple models with custom aliases
mullama serve \
--model llama:hf:TheBloke/Llama-2-7B-GGUF:llama-2-7b.Q4_K_M.gguf \
--model mistral:hf:TheBloke/Mistral-7B-v0.1-GGUF
# Interactive TUI chat
mullama chat
# One-shot generation
mullama run "What is the meaning of life?"
# Model management
mullama models # List loaded models
mullama load phi:./phi.gguf # Load a model
mullama unload phi # Unload a model
mullama default llama # Set default model
# Search for models on HuggingFace
mullama search "llama 7b" # Search GGUF models
mullama search "mistral" --files # Show available files
mullama search "phi" --all # Include non-GGUF models
mullama info TheBloke/Llama-2-7B-GGUF # Show repo details
# Cache management
mullama pull hf:TheBloke/Llama-2-7B-GGUF # Pre-download model
mullama cache list # List cached models
mullama cache size # Show cache size
mullama cache clear # Clear cache
# Use OpenAI-compatible API
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama", "messages": [{"role": "user", "content": "Hello!"}]}'
```
### HuggingFace Model Format
```
hf:<owner>/<repo>:<filename> # Specific file
hf:<owner>/<repo> # Auto-detect best GGUF
<alias>:hf:<owner>/<repo> # With custom alias
```
### Environment Variables
| `HF_TOKEN` | HuggingFace token for gated/private models |
| `MULLAMA_CACHE_DIR` | Override default cache directory |
### Cache Locations (Cross-Platform)
| Linux | `$XDG_CACHE_HOME/mullama/models` or `~/.cache/mullama/models` |
| macOS | `~/Library/Caches/mullama/models` |
| Windows | `%LOCALAPPDATA%\mullama\models` |
Architecture:
```
┌──────────────────────────────────┐
│ Daemon │
┌─────────────┐ │ ┌────────────────────────────┐ │
│ TUI Client │◄── nng (IPC) ─────►│ │ Model Manager │ │
└─────────────┘ │ │ ┌───────┐ ┌───────┐ │ │
│ │ │Model 1│ │Model 2│ ... │ │
┌─────────────┐ │ │ └───────┘ └───────┘ │ │
│ curl/app │◄── HTTP/REST ─────►│ └────────────────────────────┘ │
└─────────────┘ (OpenAI API) │ │
│ Endpoints: │
┌─────────────┐ │ • /v1/chat/completions │
│ Other Client│◄── nng (IPC) ─────►│ • /v1/completions │
└─────────────┘ │ • /v1/models │
│ • /v1/embeddings │
└──────────────────────────────────┘
```
Programmatic usage:
```rust
use mullama::daemon::{DaemonClient, DaemonBuilder};
// Connect as client
let client = DaemonClient::connect_default()?;
let result = client.chat("Hello, AI!", None, 100, 0.7)?;
println!("{} ({:.1} tok/s)", result.text, result.tokens_per_second());
// List models
for model in client.list_models()? {
println!("{}: {}M params", model.alias, model.info.parameters / 1_000_000);
}
```
## Late Interaction / ColBERT
Mullama supports ColBERT-style late interaction retrieval with multi-vector embeddings. Unlike traditional embeddings that pool all tokens into a single vector, late interaction preserves per-token embeddings for fine-grained matching using MaxSim scoring.
```rust
use mullama::late_interaction::{
MultiVectorGenerator, MultiVectorConfig, LateInteractionScorer
};
use std::sync::Arc;
// Create generator (works with any embedding model)
let model = Arc::new(Model::load("model.gguf")?);
let config = MultiVectorConfig::default()
.normalize(true)
.skip_special_tokens(true);
let mut generator = MultiVectorGenerator::new(model, config)?;
// Generate multi-vector embeddings
let query = generator.embed_text("What is machine learning?")?;
let doc = generator.embed_text("Machine learning is a branch of AI...")?;
// Score with MaxSim
let score = LateInteractionScorer::max_sim(&query, &doc);
// Top-k retrieval
let documents: Vec<_> = texts.iter()
.map(|t| generator.embed_text(t))
.collect::<Result<Vec<_>, _>>()?;
let top_k = LateInteractionScorer::find_top_k(&query, &documents, 10);
```
**With parallel processing:**
```rust
// Enable both features: ["late-interaction", "parallel"]
let top_k = LateInteractionScorer::find_top_k_parallel(&query, &documents, 10);
let scores = LateInteractionScorer::batch_score_parallel(&queries, &documents);
```
**Recommended models:**
- `LiquidAI/LFM2-ColBERT-350M-GGUF` - Purpose-trained ColBERT model
- Any GGUF embedding model (works but suboptimal for retrieval)
## GPU Acceleration
```bash
# NVIDIA CUDA
export LLAMA_CUDA=1
# Apple Metal (macOS)
export LLAMA_METAL=1
# AMD ROCm (Linux)
export LLAMA_HIPBLAS=1
# Intel OpenCL
export LLAMA_CLBLAST=1
```
## Documentation
| [Getting Started](./docs/GETTING_STARTED.md) | Installation and first application |
| [Platform Setup](./docs/PLATFORM_SETUP.md) | OS-specific setup instructions |
| [Features Guide](./docs/FEATURES.md) | Integration features overview |
| [Use Cases](./docs/USE_CASES.md) | Real-world application examples |
| [API Reference](./docs/API_REFERENCE.md) | Complete API documentation |
| [Sampling Guide](./docs/sampling.md) | Sampling strategies and configuration |
| [GPU Guide](./docs/gpu.md) | GPU acceleration setup |
| [Feature Status](./docs/FEATURE_STATUS.md) | Implementation status and roadmap |
## Examples
```bash
# Basic text generation
cargo run --example simple --features async
# Streaming responses
cargo run --example streaming_generation --features "async,streaming"
# Web service
cargo run --example web_service --features "web,websockets"
# Audio processing
cargo run --example streaming_audio_demo --features "streaming-audio,multimodal"
# Late interaction / ColBERT retrieval
cargo run --example late_interaction --features late-interaction
cargo run --example late_interaction --features late-interaction -- model.gguf
```
## Contributing
We welcome contributions! See [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines.
```bash
git clone --recurse-submodules https://github.com/neul-labs/mullama.git
cd mullama
cargo test --all-features
```
## License
MIT License - see [LICENSE](LICENSE) for details.
## llama.cpp Compatibility
Mullama tracks upstream llama.cpp releases:
| 0.1.x | b7542 | Dec 2025 |
### Supported Model Architectures
All architectures supported by llama.cpp b7542, including:
- LLaMA 1/2/3, Mistral, Mixtral, Phi-1/2/3/4
- Qwen, Qwen2, DeepSeek, Yi, Gemma
- And [many more](https://github.com/ggml-org/llama.cpp#supported-models)
## Acknowledgments
- [llama.cpp](https://github.com/ggerganov/llama.cpp) - The underlying inference engine
- [ggml](https://github.com/ggerganov/ggml) - Tensor operations library