docs.rs failed to build mullama-0.1.1
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Mullama

Comprehensive Rust bindings for llama.cpp with advanced integration features

Mullama provides memory-safe Rust bindings for llama.cpp with production-ready features including async/await support, real-time streaming, multimodal processing, and web framework integration.

Why Mullama?

Most llama.cpp Rust bindings expose low-level C APIs directly. Mullama provides an idiomatic Rust experience:

// Other wrappers: manual memory management, raw pointers, verbose setup
let params = llama_context_default_params();
let ctx = unsafe { llama_new_context_with_model(model, params) };
let tokens = unsafe { llama_tokenize(model, text.as_ptr(), ...) };
// Don't forget to free everything...

// Mullama: builder patterns, async/await, automatic resource management
let model = ModelBuilder::new()
    .path("model.gguf")
    .gpu_layers(35)
    .build().await?;

let response = model.generate("Hello", 100).await?;

Developer experience improvements:

Feature	Other Wrappers	Mullama
API Style	Raw FFI / C-like	Builder patterns, fluent API
Async Support	Manual threading	Native async/await with Tokio
Error Handling	Error codes / panics	`Result<T, MullamaError>` with context
Memory Management	Manual free/cleanup	Automatic RAII
Streaming	Callbacks	`Stream` trait, async iterators
Configuration	Struct fields	Type-safe builders with validation
Web Integration	DIY	Built-in Axum routes

Key Features

Async/Await Native - Full Tokio integration for non-blocking operations
Real-time Streaming - Token-by-token generation with backpressure handling
Multimodal Processing - Text, image, and audio in a single pipeline
Late Interaction / ColBERT - Multi-vector embeddings with MaxSim scoring for retrieval
Web Framework Ready - Direct Axum integration with REST APIs
WebSocket Support - Real-time bidirectional communication
Parallel Processing - Work-stealing parallelism for batch operations
GPU Acceleration - CUDA, Metal, ROCm, and OpenCL support
Memory Safe - Zero unsafe operations in public API

Quick Start

Installation

[dependencies]
mullama = "0.1.1"

# With all features
mullama = { version = "0.1.1", features = ["full"] }

Prerequisites

Linux (Ubuntu/Debian):

sudo apt install -y build-essential cmake pkg-config libasound2-dev libpulse-dev

macOS:

brew install cmake pkg-config portaudio

Windows: Install Visual Studio Build Tools and CMake.

See Platform Setup Guide for detailed instructions.

Basic Example

use mullama::prelude::*;

#[tokio::main]
async fn main() -> Result<(), MullamaError> {
    let model = ModelBuilder::new()
        .path("model.gguf")
        .context_size(4096)
        .build().await?;

    let response = model.generate("The future of AI is", 100).await?;
    println!("{}", response);

    Ok(())
}

Feature Flags

[dependencies.mullama]
version = "0.1.1"
features = [
    "async",              # Async/await support
    "streaming",          # Token streaming
    "web",                # Axum web framework
    "websockets",         # WebSocket support
    "multimodal",         # Image and audio processing
    "streaming-audio",    # Real-time audio capture
    "format-conversion",  # Audio/image format conversion
    "parallel",           # Rayon parallel processing
    "late-interaction",   # ColBERT-style multi-vector embeddings
    "daemon",             # Daemon mode with TUI client
    "full"                # All features
]

Common Combinations

# Web applications
features = ["web", "websockets", "async", "streaming"]

# Multimodal AI
features = ["multimodal", "streaming-audio", "format-conversion"]

# High-performance batch processing
features = ["parallel", "async"]

# Semantic search / RAG with ColBERT-style retrieval
features = ["late-interaction", "parallel"]

# Daemon with TUI chat interface
features = ["daemon"]

Daemon Mode

Mullama includes a multi-model daemon with OpenAI-compatible HTTP API and TUI client:

# Build the CLI
cargo build --release --features daemon

# Start daemon with local model
mullama serve --model llama:./llama.gguf

# Start with HuggingFace model (auto-downloads and caches)
mullama serve --model hf:TheBloke/Llama-2-7B-GGUF

# Multiple models with custom aliases
mullama serve \
  --model llama:hf:TheBloke/Llama-2-7B-GGUF:llama-2-7b.Q4_K_M.gguf \
  --model mistral:hf:TheBloke/Mistral-7B-v0.1-GGUF

# Interactive TUI chat
mullama chat

# One-shot generation
mullama run "What is the meaning of life?"

# Model management
mullama models            # List loaded models
mullama load phi:./phi.gguf  # Load a model
mullama unload phi        # Unload a model
mullama default llama     # Set default model

# Search for models on HuggingFace
mullama search "llama 7b"          # Search GGUF models
mullama search "mistral" --files   # Show available files
mullama search "phi" --all         # Include non-GGUF models
mullama info TheBloke/Llama-2-7B-GGUF  # Show repo details

# Cache management
mullama pull hf:TheBloke/Llama-2-7B-GGUF  # Pre-download model
mullama cache list        # List cached models
mullama cache size        # Show cache size
mullama cache clear       # Clear cache

# Use OpenAI-compatible API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama", "messages": [{"role": "user", "content": "Hello!"}]}'

HuggingFace Model Format

hf:<owner>/<repo>:<filename>   # Specific file
hf:<owner>/<repo>              # Auto-detect best GGUF
<alias>:hf:<owner>/<repo>      # With custom alias

Environment Variables

Variable	Description
`HF_TOKEN`	HuggingFace token for gated/private models
`MULLAMA_CACHE_DIR`	Override default cache directory

Cache Locations (Cross-Platform)

Platform	Default Location
Linux	`$XDG_CACHE_HOME/mullama/models` or `~/.cache/mullama/models`
macOS	`~/Library/Caches/mullama/models`
Windows	`%LOCALAPPDATA%\mullama\models`

Architecture:

                                   ┌──────────────────────────────────┐
                                   │           Daemon                 │
┌─────────────┐                    │  ┌────────────────────────────┐  │
│  TUI Client │◄── nng (IPC) ─────►│  │     Model Manager          │  │
└─────────────┘                    │  │  ┌───────┐  ┌───────┐      │  │
                                   │  │  │Model 1│  │Model 2│ ...  │  │
┌─────────────┐                    │  │  └───────┘  └───────┘      │  │
│   curl/app  │◄── HTTP/REST ─────►│  └────────────────────────────┘  │
└─────────────┘   (OpenAI API)     │                                  │
                                   │  Endpoints:                      │
┌─────────────┐                    │  • /v1/chat/completions          │
│ Other Client│◄── nng (IPC) ─────►│  • /v1/completions               │
└─────────────┘                    │  • /v1/models                    │
                                   │  • /v1/embeddings                │
                                   └──────────────────────────────────┘

Programmatic usage:

use mullama::daemon::{DaemonClient, DaemonBuilder};

// Connect as client
let client = DaemonClient::connect_default()?;
let result = client.chat("Hello, AI!", None, 100, 0.7)?;
println!("{} ({:.1} tok/s)", result.text, result.tokens_per_second());

// List models
for model in client.list_models()? {
    println!("{}: {}M params", model.alias, model.info.parameters / 1_000_000);
}

Late Interaction / ColBERT

Mullama supports ColBERT-style late interaction retrieval with multi-vector embeddings. Unlike traditional embeddings that pool all tokens into a single vector, late interaction preserves per-token embeddings for fine-grained matching using MaxSim scoring.

use mullama::late_interaction::{
    MultiVectorGenerator, MultiVectorConfig, LateInteractionScorer
};
use std::sync::Arc;

// Create generator (works with any embedding model)
let model = Arc::new(Model::load("model.gguf")?);
let config = MultiVectorConfig::default()
    .normalize(true)
    .skip_special_tokens(true);
let mut generator = MultiVectorGenerator::new(model, config)?;

// Generate multi-vector embeddings
let query = generator.embed_text("What is machine learning?")?;
let doc = generator.embed_text("Machine learning is a branch of AI...")?;

// Score with MaxSim
let score = LateInteractionScorer::max_sim(&query, &doc);

// Top-k retrieval
let documents: Vec<_> = texts.iter()
    .map(|t| generator.embed_text(t))
    .collect::<Result<Vec<_>, _>>()?;
let top_k = LateInteractionScorer::find_top_k(&query, &documents, 10);

With parallel processing:

// Enable both features: ["late-interaction", "parallel"]
let top_k = LateInteractionScorer::find_top_k_parallel(&query, &documents, 10);
let scores = LateInteractionScorer::batch_score_parallel(&queries, &documents);

Recommended models:

LiquidAI/LFM2-ColBERT-350M-GGUF - Purpose-trained ColBERT model
Any GGUF embedding model (works but suboptimal for retrieval)

GPU Acceleration

# NVIDIA CUDA
export LLAMA_CUDA=1

# Apple Metal (macOS)
export LLAMA_METAL=1

# AMD ROCm (Linux)
export LLAMA_HIPBLAS=1

# Intel OpenCL
export LLAMA_CLBLAST=1

Documentation

Document	Description
Getting Started	Installation and first application
Platform Setup	OS-specific setup instructions
Features Guide	Integration features overview
Use Cases	Real-world application examples
API Reference	Complete API documentation
Sampling Guide	Sampling strategies and configuration
GPU Guide	GPU acceleration setup
Feature Status	Implementation status and roadmap

Examples

# Basic text generation
cargo run --example simple --features async

# Streaming responses
cargo run --example streaming_generation --features "async,streaming"

# Web service
cargo run --example web_service --features "web,websockets"

# Audio processing
cargo run --example streaming_audio_demo --features "streaming-audio,multimodal"

# Late interaction / ColBERT retrieval
cargo run --example late_interaction --features late-interaction
cargo run --example late_interaction --features late-interaction -- model.gguf

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

git clone --recurse-submodules https://github.com/neul-labs/mullama.git
cd mullama
cargo test --all-features

License

MIT License - see LICENSE for details.

llama.cpp Compatibility

Mullama tracks upstream llama.cpp releases:

Mullama Version	llama.cpp Version	Release Date
0.1.x	b7542	Dec 2025

Supported Model Architectures

All architectures supported by llama.cpp b7542, including:

LLaMA 1/2/3, Mistral, Mixtral, Phi-1/2/3/4
Qwen, Qwen2, DeepSeek, Yi, Gemma
And many more

Acknowledgments

llama.cpp - The underlying inference engine
ggml - Tensor operations library

mullama 0.1.1