<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>llama.cpp Integration · RavenClaws Docs</title>
<meta name="description" content="Use RavenClaws with llama.cpp for lightweight, CPU-first local inference via the OpenAI-compatible provider.">
<link rel="canonical" href="https://ravenclaws.io/docs/llamacpp">
<meta name="theme-color" content="#070a10">
<meta property="og:title" content="RavenClaws llama.cpp Integration">
<meta property="og:description" content="Lightweight CPU inference with llama.cpp.">
<meta property="og:image" content="https://ravenclaws.io/assets/og-image.png">
<meta name="twitter:card" content="summary_large_image">
<link rel="icon" href="/assets/favicon.ico" sizes="any">
<link rel="icon" type="image/png" href="/assets/favicon-32.png" sizes="32x32">
<link rel="apple-touch-icon" href="/assets/apple-touch-icon.png">
<link rel="stylesheet" href="/assets/styles.css">
</head>
<body>
<a class="skip" href="#main">Skip to content</a>
<header class="site-header">
<div class="wrap">
<nav class="nav" aria-label="Primary">
<a class="brand" href="/"><img src="/assets/favicon-512.png" alt="" width="30" height="30"><span>Raven<b>Claws</b></span></a>
<div class="nav-links">
<a href="/#features">Features</a><a href="/#providers">Providers</a><a href="/#security">Security</a><a href="/docs/">Docs</a><a href="/#license">License</a>
</div>
<span class="nav-spacer"></span>
<div class="nav-cta">
<a class="ghost-pill" href="https://crates.io/crates/ravenclaws" rel="noopener">crates.io</a>
<a class="btn btn--primary btn--sm" href="https://github.com/egkristi/RavenClaws" rel="noopener">GitHub</a>
</div>
<button class="nav-toggle" aria-label="Menu" aria-expanded="false"><svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2"><path d="M3 6h18M3 12h18M3 18h18"/></svg></button>
</nav>
</div>
</header>
<main id="main">
<div class="wrap">
<div class="docs">
<aside class="docs-side">
<h5>Documentation</h5>
<a href="/docs/">Overview</a>
<a href="/docs/getting-started">Getting started</a>
<a href="/docs/configuration">Configuration</a>
<a href="/docs/interaction-modes">Interaction modes</a>
<a href="/docs/swarm-mode">Swarm mode</a>
<a href="/docs/mcp-integration">MCP integration</a>
<a href="/docs/heartbeat-mode">Heartbeat mode</a>
<a href="/docs/server-mode">Server mode</a>
<a href="/docs/vllm">vLLM</a>
<a href="/docs/llamacpp" class="active">llama.cpp</a>
<a href="/docs/demo">Demo</a>
<a href="/docs/migration">Migration guide</a>
<h5>On this page</h5>
<a href="#quick-start" data-spy>Quick start</a>
<a href="#configuration" data-spy>Configuration</a>
<a href="#tool-calling" data-spy>Tool calling</a>
<a href="#troubleshooting" data-spy>Troubleshooting</a>
<a href="#performance" data-spy>Performance tips</a>
<a href="#multi-model" data-spy>Multi-model</a>
</aside>
<article class="doc-body">
<p class="breadcrumb"><a href="/docs/">Docs</a> / llama.cpp</p>
<h1>llama.cpp integration</h1>
<p class="lead-box"><a href="https://github.com/ggerganov/llama.cpp" rel="noopener">llama.cpp</a> is a lightweight, CPU-first inference engine for LLMs using GGUF format models. RavenClaws supports llama.cpp via the generic <code>openai-compatible</code> provider — no special configuration needed.</p>
<h2 id="quick-start">Quick start</h2>
<h3>1. Start llama.cpp server</h3>
<div class="code"><div class="code__bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span class="label">shell</span><button class="code__copy" type="button">Copy</button></div>
<pre><code><span class="tok-c"># Download a GGUF model</span>
<span class="tok-d">wget</span> https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/mistral-7b-instruct-v0.3.Q4_K_M.gguf
<span class="tok-c"># Start the server</span>
<span class="tok-d">llama-server</span> <span class="tok-k">-m</span> mistral-7b-instruct-v0.3.Q4_K_M.gguf <span class="tok-k">--port</span> 8080</code></pre></div>
<p>Or using Docker:</p>
<div class="code"><div class="code__bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span class="label">shell</span><button class="code__copy" type="button">Copy</button></div>
<pre><code><span class="tok-d">docker run --rm -p 8080:8080</span> \
<span class="tok-k">-v</span> $(pwd)/models:/models \
<span class="tok-d">ghcr.io/ggerganov/llama.cpp:server</span> \
<span class="tok-k">-m</span> /models/mistral-7b-instruct-v0.3.Q4_K_M.gguf \
<span class="tok-k">--port</span> 8080</code></pre></div>
<h3>2. Configure RavenClaws</h3>
<p>Via environment variables:</p>
<div class="code"><div class="code__bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span class="label">shell</span><button class="code__copy" type="button">Copy</button></div>
<pre><code><span class="tok-k">export</span> RAVENCLAWS__LLM__PROVIDER=<span class="tok-s">"openai-compatible"</span>
<span class="tok-k">export</span> RAVENCLAWS__LLM__ENDPOINT=<span class="tok-s">"http://localhost:8080/v1/chat/completions"</span>
<span class="tok-k">export</span> RAVENCLAWS__LLM__MODEL=<span class="tok-s">"mistral-7b-instruct-v0.3"</span>
<span class="tok-d">ravenclaws</span> <span class="tok-k">--exec</span> <span class="tok-s">"What is the capital of France?"</span></code></pre></div>
<h2 id="configuration">Configuration reference</h2>
<div class="table-wrap">
<table>
<thead><tr><th>Field</th><th>Value</th><th>Description</th></tr></thead>
<tbody>
<tr><td><code>provider</code></td><td><code>openai-compatible</code></td><td>Must be set to <code>openai-compatible</code></td></tr>
<tr><td><code>endpoint</code></td><td><code>http://localhost:8080/v1/chat/completions</code></td><td>llama.cpp's OpenAI-compatible endpoint</td></tr>
<tr><td><code>model</code></td><td>(model name)</td><td>The GGUF model loaded in llama.cpp</td></tr>
<tr><td><code>api_key</code></td><td>(optional)</td><td>Not needed for local llama.cpp</td></tr>
</tbody>
</table>
</div>
<h2 id="tool-calling">Tool-calling support</h2>
<div class="table-wrap">
<table>
<thead><tr><th>Backend</th><th>Tool calling</th><th>Notes</th></tr></thead>
<tbody>
<tr><td>llama.cpp</td><td>❌ None</td><td>GGUF format does not support structured tool calling</td></tr>
<tr><td>RavenClaws fallback</td><td>✅ Text-based parsing</td><td>Detects <code>TOOL_CALL:</code> / <code>ARGS:</code> patterns automatically</td></tr>
</tbody>
</table>
</div>
<p>llama.cpp does not support structured function calling (OpenAI tools format). However, RavenClaws's agent loop includes a text-based fallback that detects <code>TOOL_CALL:</code> and <code>ARGS:</code> patterns in the model's response text. For best results, use <code>--no-final-required</code>:</p>
<div class="code"><div class="code__bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span class="label">shell</span><button class="code__copy" type="button">Copy</button></div>
<pre><code><span class="tok-d">ravenclaws</span> <span class="tok-k">--provider</span> openai-compatible \
<span class="tok-k">--endpoint</span> http://localhost:8080/v1/chat/completions \
<span class="tok-k">--model</span> mistral-7b-instruct-v0.3 \
<span class="tok-k">--exec</span> <span class="tok-s">"List the files in the current directory"</span> \
<span class="tok-k">--no-final-required</span></code></pre></div>
<h2 id="troubleshooting">Troubleshooting</h2>
<div class="table-wrap">
<table>
<thead><tr><th>Problem</th><th>Likely cause</th><th>Solution</th></tr></thead>
<tbody>
<tr><td>Connection refused</td><td>llama.cpp not running</td><td>Start <code>llama-server</code> with your GGUF model</td></tr>
<tr><td>Model not found</td><td>Wrong model name</td><td>Check <code>curl http://localhost:8080/v1/models</code></td></tr>
<tr><td>Empty response</td><td>Model not fully loaded</td><td>Wait for llama.cpp to finish loading</td></tr>
<tr><td>Slow inference</td><td>CPU-only inference</td><td>Use a smaller quantized model (Q4_K_M or Q3_K_S)</td></tr>
<tr><td>Tool calls not working</td><td>GGUF doesn't support tools</td><td>Use <code>--no-final-required</code> and rely on text-based fallback</td></tr>
<tr><td>High memory usage</td><td>Large model on CPU</td><td>Use a smaller GGUF quantization (Q4_K_M is a good balance)</td></tr>
</tbody>
</table>
</div>
<h2 id="performance">Performance tips</h2>
<ul>
<li><strong>Use quantized models</strong> — Q4_K_M offers the best quality-to-speed ratio</li>
<li><strong>Match context window</strong> — Set <code>--ctx-size</code> to match your task needs (4096+ recommended for agent tasks)</li>
<li><strong>Batch size</strong> — Increase <code>--batch-size</code> for faster prompt processing</li>
<li><strong>GPU offloading</strong> — Use <code>-ngl N</code> to offload N layers to GPU if available</li>
</ul>
<h2 id="multi-model">Multi-model with llama.cpp</h2>
<p>You can use llama.cpp alongside other providers in multi-model mode:</p>
<div class="code"><div class="code__bar"><span class="dot"></span><span class="dot"></span><span class="dot"></span><span class="label">ravenclaws.toml</span><button class="code__copy" type="button">Copy</button></div>
<pre><code>[llm]
provider = <span class="tok-s">"multi"</span>
[[llm.models]]
provider = <span class="tok-s">"openai-compatible"</span>
endpoint = <span class="tok-s">"http://localhost:8080/v1/chat/completions"</span>
model = <span class="tok-s">"mistral-7b-instruct-v0.3"</span>
[[llm.models]]
provider = <span class="tok-s">"openai"</span>
model = <span class="tok-s">"gpt-4o"</span>
api_key = <span class="tok-s">"${OPENAI_API_KEY}"</span></code></pre></div>
<nav class="doc-nav">
<a class="prev" href="/docs/vllm"><span class="dir">← Previous</span><br><span class="ttl">vLLM</span></a>
<a class="next" href="/docs/migration"><span class="dir">Next →</span><br><span class="ttl">Migration guide</span></a>
</nav>
</article>
</div>
</div>
</main>
<footer class="site-footer">
<div class="wrap">
<div class="foot-bottom" style="border-top:0">
<span>© <span data-year>2026</span> RavenClaws · AGPL-3.0-or-later + Commercial</span>
<span class="made">Built in <b>Rust</b> 🦀 · Deployed on Cloudflare</span>
</div>
</div>
</footer>
<script src="/assets/main.js" defer></script>
</body>
</html>