llama-cpp-v3-agent-sdk 0.1.7

Agentic tool-use loop on top of llama-cpp-v3 — local LLM agents with built-in tools
Documentation

llama-cpp-v3-agent-sdk

Agentic tool-use loop and multi-agent orchestration on top of llama-cpp-v3. Built for developers who need high-performance, local LLM agents with predictable state management and VRAM efficiency.

Features

  • 🤖 Agentic Loop: Generate → parse tool calls → execute → repeat.
  • 🔧 Native Tooling: High-fidelity bash, read, write, edit, and glob implementations.
  • 🔒 Granular Permissions: Callback-based system for human-in-the-loop tool approval.
  • 🏗️ Memory-Mapped Inference: Shared model weights across multiple agents via InferenceEngine.
  • 🌊 Workflow Orchestration: Define complex multi-agent pipelines in JSON with dynamic input mapping and conditional execution.
  • 📡 Reactive Events: Stream text deltas, tool lifecycle events, and pipeline progress.

Technical Architecture

1. VRAM Management & Context Pooling

Unlike traditional wrappers that initialize a full context per agent, llama-cpp-v3-agent-sdk uses an InferenceScheduler to manage physical memory.

  • Shared Engine: Multiple agents can share a single InferenceEngine instance, loading the model weights into VRAM only once.
  • Context Pooling: Agents do not own a LlamaContext by default. Instead, they request a temporary permit from a pool.
  • Serialization: Use InferenceScheduler::new(1) to ensure only one agent runs inference at a time, allowing multiple complex agents to run on consumer hardware with limited VRAM.

2. Multi-Agent Workflow Engine

The WorkflowEngine is a stateless runner designed to orchestrate sequences of agents. It decouples the execution logic from state persistence.

  • Input Resolution (input_mapping): Steps define data dependencies. The engine resolves keys from the initial context and overrides them with artifacts from named previous steps. JSON artifacts are automatically parsed into structured objects for prompts.
  • Conditional Execution: Supports dynamic skipping based on previous outputs using dot-notation (e.g., step_name.field_name) to look into dependency JSON results.
  • Workflow Storage: To maintain state, implement the WorkflowStorage trait. This enables Workflow Resumption, skipping already-completed steps if a pipeline is restarted with the same session_id.

3. Conversation & Message Compaction

The Conversation struct manages message history. When the context limit (n_ctx) is reached, the agent employs a compaction strategy to summarize or prune older messages while preserving the system prompt and critical context.

4. Tool-Use Loop Mechanics

The agent follows a rigorous cycle:

  1. Generate: The LLM generates text until it produces a <tool_call> block or hits a stop sequence.
  2. Parse: The engine parses the block using a JSON schema for the requested tool.
  3. Execute: The tool's execute method is called (subject to permissions).
  4. Repeat: The tool output is appended as a tool message, and the cycle repeats until a final answer is produced.

Configuration Reference

AgentBuilder API

Model Loading (Standalone Mode)

Method Description Technical Detail
.backend(Backend) Set compute target Supports CUDA, Vulkan, SYCL, and CPU.
.model_path(&str) GGUF model path Path to the weights file.
.n_gpu_layers(i32) GPU offloading -1 for all, 0 for none.
.chat_template(&str) Custom template Jinja template for prompt formatting.
.app_name(&str) App identifier Used for cache directory naming.
.cache_dir(PathBuf) DLL cache Where runtime DLLs are downloaded/stored.
.explicit_dll_path(PathBuf) Bypass download Direct path to llama.dll.
.dll_version(&str) Version pinning Specific llama.cpp version to download.

Shared Engine Mode

Method Description Technical Detail
.engine(Arc<InferenceEngine>) Shared weights Injects an already loaded model.
.scheduler(Arc<Scheduler>) Context pooling Required for high-density agent deployments.

Inference & Sampling

Method Description Technical Detail
.n_ctx(u32) Context size Maximum tokens (history + generation).
.temperature(f32) Creativity Higher = more random, Lower = more deterministic.
.max_tokens_per_completion(usize) Token limit Hard cap on a single model call.
.top_k(i32) Top-K sampling Limits vocabulary to K most likely tokens.
.min_p(f32) Min-P sampling Filters tokens based on probability threshold.
.repeat_penalty(f32) Repetition control Penalizes already generated tokens.
.stop_sequence(&str) Stop markers Add strings that terminate generation.

Agent Behavior & Loop

Method Description Technical Detail
.system_prompt(&str) Instructions Defines the agent's persona and mission.
.max_iterations(usize) Iteration cap Prevents runaway tool-use loops.
.auto_approve() YOLO mode Disables the permission system.
.permission_callback(F) Human-in-the-loop Callback for manual tool approval.
.tool(Box<dyn Tool>) Custom tools Register your own tool implementations.
.skip_builtin_tools() Minimal mode Disables bash, read, write, etc.

Skills & Discovery

Method Description Technical Detail
.skills_path(PathBuf) Search paths Adds directories for skill discovery.
.activate_skill(&str) Specific skill Explicitly loads a named skill.
.no_skills() Disable skills Prevents scanning for skill folders.
.no_agents_md() Disable registry Prevents scanning for AGENTS.md files.

Documentation & Tutorials


Quick Start (Standalone Agent)

use llama_cpp_v3_agent_sdk::{Agent, AgentBuilder};
use llama_cpp_v3::backend::Backend;

fn main() -> anyhow::Result<()> {
    let mut agent = AgentBuilder::new()
        .backend(Backend::Vulkan)
        .model_path("model.gguf")
        .n_gpu_layers(-1) // Offload all to GPU
        .system_prompt("You are a system administrator.")
        .auto_approve()
        .build()?;

    let response = agent.chat_simple("What is the disk usage in the current directory?")?;
    println!("Agent says: {}", response);
    Ok(())
}

License

MIT