mullama 0.1.0

Comprehensive Rust bindings for llama.cpp with memory-safe API and advanced features
docs.rs failed to build mullama-0.1.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Mullama

Comprehensive Rust bindings for llama.cpp with advanced integration features

Crates.io Documentation License

Mullama provides memory-safe Rust bindings for llama.cpp with production-ready features including async/await support, real-time streaming, multimodal processing, and web framework integration.

Why Mullama?

Most llama.cpp Rust bindings expose low-level C APIs directly. Mullama provides an idiomatic Rust experience:

// Other wrappers: manual memory management, raw pointers, verbose setup
let params = llama_context_default_params();
let ctx = unsafe { llama_new_context_with_model(model, params) };
let tokens = unsafe { llama_tokenize(model, text.as_ptr(), ...) };
// Don't forget to free everything...

// Mullama: builder patterns, async/await, automatic resource management
let model = ModelBuilder::new()
    .path("model.gguf")
    .gpu_layers(35)
    .build().await?;

let response = model.generate("Hello", 100).await?;

Developer experience improvements:

Feature Other Wrappers Mullama
API Style Raw FFI / C-like Builder patterns, fluent API
Async Support Manual threading Native async/await with Tokio
Error Handling Error codes / panics Result<T, MullamaError> with context
Memory Management Manual free/cleanup Automatic RAII
Streaming Callbacks Stream trait, async iterators
Configuration Struct fields Type-safe builders with validation
Web Integration DIY Built-in Axum routes

Key Features

  • Async/Await Native - Full Tokio integration for non-blocking operations
  • Real-time Streaming - Token-by-token generation with backpressure handling
  • Multimodal Processing - Text, image, and audio in a single pipeline
  • Web Framework Ready - Direct Axum integration with REST APIs
  • WebSocket Support - Real-time bidirectional communication
  • Parallel Processing - Work-stealing parallelism for batch operations
  • GPU Acceleration - CUDA, Metal, ROCm, and OpenCL support
  • Memory Safe - Zero unsafe operations in public API

Quick Start

Installation

[dependencies]
mullama = "0.1.0"

# With all features
mullama = { version = "0.1.0", features = ["full"] }

Prerequisites

Linux (Ubuntu/Debian):

sudo apt install -y build-essential cmake pkg-config libasound2-dev libpulse-dev

macOS:

brew install cmake pkg-config portaudio

Windows: Install Visual Studio Build Tools and CMake.

See Platform Setup Guide for detailed instructions.

Basic Example

use mullama::prelude::*;

#[tokio::main]
async fn main() -> Result<(), MullamaError> {
    let model = ModelBuilder::new()
        .path("model.gguf")
        .context_size(4096)
        .build().await?;

    let response = model.generate("The future of AI is", 100).await?;
    println!("{}", response);

    Ok(())
}

Feature Flags

[dependencies.mullama]
version = "0.1.0"
features = [
    "async",              # Async/await support
    "streaming",          # Token streaming
    "web",                # Axum web framework
    "websockets",         # WebSocket support
    "multimodal",         # Image and audio processing
    "streaming-audio",    # Real-time audio capture
    "format-conversion",  # Audio/image format conversion
    "parallel",           # Rayon parallel processing
    "daemon",             # Daemon mode with TUI client
    "full"                # All features
]

Common Combinations

# Web applications
features = ["web", "websockets", "async", "streaming"]

# Multimodal AI
features = ["multimodal", "streaming-audio", "format-conversion"]

# High-performance batch processing
features = ["parallel", "async"]

# Daemon with TUI chat interface
features = ["daemon"]

Daemon Mode

Mullama includes a multi-model daemon with OpenAI-compatible HTTP API and TUI client:

# Build the CLI
cargo build --release --features daemon

# Start daemon with local model
mullama serve --model llama:./llama.gguf

# Start with HuggingFace model (auto-downloads and caches)
mullama serve --model hf:TheBloke/Llama-2-7B-GGUF

# Multiple models with custom aliases
mullama serve \
  --model llama:hf:TheBloke/Llama-2-7B-GGUF:llama-2-7b.Q4_K_M.gguf \
  --model mistral:hf:TheBloke/Mistral-7B-v0.1-GGUF

# Interactive TUI chat
mullama chat

# One-shot generation
mullama run "What is the meaning of life?"

# Model management
mullama models            # List loaded models
mullama load phi:./phi.gguf  # Load a model
mullama unload phi        # Unload a model
mullama default llama     # Set default model

# Search for models on HuggingFace
mullama search "llama 7b"          # Search GGUF models
mullama search "mistral" --files   # Show available files
mullama search "phi" --all         # Include non-GGUF models
mullama info TheBloke/Llama-2-7B-GGUF  # Show repo details

# Cache management
mullama pull hf:TheBloke/Llama-2-7B-GGUF  # Pre-download model
mullama cache list        # List cached models
mullama cache size        # Show cache size
mullama cache clear       # Clear cache

# Use OpenAI-compatible API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama", "messages": [{"role": "user", "content": "Hello!"}]}'

HuggingFace Model Format

hf:<owner>/<repo>:<filename>   # Specific file
hf:<owner>/<repo>              # Auto-detect best GGUF
<alias>:hf:<owner>/<repo>      # With custom alias

Environment Variables

Variable Description
HF_TOKEN HuggingFace token for gated/private models
MULLAMA_CACHE_DIR Override default cache directory

Cache Locations (Cross-Platform)

Platform Default Location
Linux $XDG_CACHE_HOME/mullama/models or ~/.cache/mullama/models
macOS ~/Library/Caches/mullama/models
Windows %LOCALAPPDATA%\mullama\models

Architecture:

                                   ┌──────────────────────────────────┐
                                   │           Daemon                 │
┌─────────────┐                    │  ┌────────────────────────────┐  │
│  TUI Client │◄── nng (IPC) ─────►│  │     Model Manager          │  │
└─────────────┘                    │  │  ┌───────┐  ┌───────┐      │  │
                                   │  │  │Model 1│  │Model 2│ ...  │  │
┌─────────────┐                    │  │  └───────┘  └───────┘      │  │
│   curl/app  │◄── HTTP/REST ─────►│  └────────────────────────────┘  │
└─────────────┘   (OpenAI API)     │                                  │
                                   │  Endpoints:                      │
┌─────────────┐                    │  • /v1/chat/completions          │
│ Other Client│◄── nng (IPC) ─────►│  • /v1/completions               │
└─────────────┘                    │  • /v1/models                    │
                                   │  • /v1/embeddings                │
                                   └──────────────────────────────────┘

Programmatic usage:

use mullama::daemon::{DaemonClient, DaemonBuilder};

// Connect as client
let client = DaemonClient::connect_default()?;
let result = client.chat("Hello, AI!", None, 100, 0.7)?;
println!("{} ({:.1} tok/s)", result.text, result.tokens_per_second());

// List models
for model in client.list_models()? {
    println!("{}: {}M params", model.alias, model.info.parameters / 1_000_000);
}

GPU Acceleration

# NVIDIA CUDA
export LLAMA_CUDA=1

# Apple Metal (macOS)
export LLAMA_METAL=1

# AMD ROCm (Linux)
export LLAMA_HIPBLAS=1

# Intel OpenCL
export LLAMA_CLBLAST=1

Documentation

Document Description
Getting Started Installation and first application
Platform Setup OS-specific setup instructions
Features Guide Integration features overview
Use Cases Real-world application examples
API Reference Complete API documentation
Sampling Guide Sampling strategies and configuration
GPU Guide GPU acceleration setup
Feature Status Implementation status and roadmap

Examples

# Basic text generation
cargo run --example simple --features async

# Streaming responses
cargo run --example streaming_generation --features "async,streaming"

# Web service
cargo run --example web_service --features "web,websockets"

# Audio processing
cargo run --example streaming_audio_demo --features "streaming-audio,multimodal"

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

git clone --recurse-submodules https://github.com/neul-labs/mullama.git
cd mullama
cargo test --all-features

License

MIT License - see LICENSE for details.

Acknowledgments

  • llama.cpp - The underlying inference engine
  • ggml - Tensor operations library