docs.rs failed to build mullama-0.3.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Mullama

Run any LLM locally. Use it from any language. Deploy anywhere.

Mullama is a local LLM server and library that works just like Ollama — same CLI commands, same model format, same Modelfile syntax — but with native language bindings for Python, Node.js, Go, PHP, Rust, and C/C++. Embed inference directly in your app with zero HTTP overhead, or run it as a server with OpenAI and Anthropic-compatible APIs.

Install

# One-liner (Linux/macOS)
curl -fsSL https://mullama.cognisoc.com/install.sh | sh

# Windows (PowerShell)
iwr -useb https://mullama.cognisoc.com/install.ps1 | iex

# Or via package managers
pip install mullama          # Python
npm install mullama          # Node.js
cargo add mullama            # Rust
go get github.com/cognisoc/mullama   # Go
composer require mullama/mullama      # PHP

Quick Start

# Run a model (daemon auto-starts)
mullama run llama3.2:1b "What is the capital of France?"

# Interactive chat
mullama chat

# Start an OpenAI-compatible server
mullama serve --model llama3.2:1b

Coming from Ollama? Your commands work unchanged — run, pull, serve, list, ps, create, show, rm, cp.

Use as a Library

Embed LLM inference directly in your application — no server, no HTTP overhead, no separate process.

Python:

from mullama import Model, Context

model = Model.load('llama3.2-1b.gguf', n_gpu_layers=32)
ctx = Context(model, n_ctx=4096)
response = ctx.generate('Hello, AI!')
print(response)

Node.js:

const { Model, Context } = require('mullama');

const model = await Model.load('llama3.2-1b.gguf', { gpuLayers: 32 });
const ctx = new Context(model, { contextSize: 4096 });
const response = await ctx.generate('Hello, AI!');
console.log(response);

Rust:

use mullama::{Model, Context, ContextParams};

let model = Model::load("llama3.2-1b.gguf")?;
let mut ctx = Context::new(&model, ContextParams::default())?;
let response = ctx.generate("Hello, AI!", 256)?;
println!("{}", response);

Go:

import "github.com/cognisoc/mullama"

model, _ := mullama.LoadModel("llama3.2-1b.gguf", &mullama.ModelParams{NGPULayers: 32})
ctx, _ := mullama.NewContext(model, &mullama.ContextParams{NCtx: 4096})
response, _ := ctx.Generate("Hello, AI!", 256, nil)
fmt.Println(response)

PHP:

use Mullama\Model;
use Mullama\Context;

$model = Model::load('llama3.2-1b.gguf', ['nGpuLayers' => 32]);
$ctx = new Context($model, ['nCtx' => 4096]);
$response = $ctx->generate('Hello, AI!');
echo $response;

See the bindings documentation for full API details.

Why Mullama?


Native bindings for 6 languages	Python, Node.js, Go, PHP, Rust, C/C++ — call models directly, no HTTP roundtrips
Drop-in Ollama replacement	Same CLI commands, same Modelfile format, same model registry
OpenAI + Anthropic API compatible	Use your existing SDKs and tools without changes
Embed in any app	Run inference in-process — no separate daemon required
7 GPU backends	CUDA, Metal, ROCm, OpenCL, Vulkan, SYCL, RPC
Multimodal	Text, image, and real-time audio with voice activity detection
Built-in Web UI and TUI	Chat interface, model management, and API playground

What You Can Build

Chatbots and assistants — Streaming responses, multi-turn context, and custom system prompts
RAG pipelines — Embeddings, ColBERT-style semantic search, and grammar-constrained generation
Voice assistants — Real-time audio capture with VAD, speech-to-text, and streaming LLM responses
API servers — Production-ready OpenAI-compatible endpoints with streaming SSE
Edge deployments — Embed a model directly in your app with no network dependency
Batch processing — Parallel inference across documents with work-stealing scheduling

Ollama Compatibility

Feature	Mullama	Ollama
CLI commands (`run`, `pull`, `serve`, etc.)	Same syntax	—
Modelfile format	Compatible	—
GGUF models	Yes	Yes
OpenAI API	Yes	Yes
Anthropic API	Yes	No
Native language bindings	6 languages	HTTP only
Embed in your app (no daemon)	Yes	No
Built-in Web UI	Yes	No
Built-in TUI chat	Yes	No

Full comparison | Migration guide

GPU Acceleration

Set the environment variable for your hardware before building:

export LLAMA_CUDA=1      # NVIDIA CUDA
export LLAMA_METAL=1     # Apple Silicon
export LLAMA_HIPBLAS=1   # AMD ROCm
export LLAMA_CLBLAST=1   # Intel/AMD OpenCL
export LLAMA_VULKAN=1    # Vulkan (cross-platform)
export LLAMA_SYCL=1      # Intel SYCL/oneAPI
export LLAMA_RPC=1       # Distributed RPC backend

Roadmap

v0.4 — Speculative decoding, prompt caching, improved quantization support
v0.5 — Distributed inference across multiple nodes, model sharding
v0.6 — Built-in fine-tuning (LoRA/QLoRA), training data pipelines
v1.0 — Stable API, LTS release, comprehensive benchmarks

Documentation

Full documentation is available at docs.cognisoc.com/mullama.

Guides cover installation, library usage, daemon configuration, language bindings, advanced features, API reference, and tutorials.

Contributing

git clone --recurse-submodules https://github.com/cognisoc/mullama.git
cd mullama
cargo test --all-features

See CONTRIBUTING.md for guidelines.

License

MIT License — see LICENSE for details.

mullama 0.3.0