llama-cpp-v3-agent-sdk

Agentic tool-use loop and multi-agent orchestration on top of llama-cpp-v3. Built for developers who need high-performance, local LLM agents with predictable state management and VRAM efficiency.

Features

🤖 Agentic Loop: Generate → parse tool calls → execute → repeat.
🔧 Native Tooling: High-fidelity bash, read, write, edit, and glob implementations.
🔒 Granular Permissions: Callback-based system for human-in-the-loop tool approval.
🏗️ Memory-Mapped Inference: Shared model weights across multiple agents via InferenceEngine.
🌊 Workflow Orchestration: Define complex multi-agent pipelines in JSON with dynamic input mapping and conditional execution.
📡 Reactive Events: Stream text deltas, tool lifecycle events, and pipeline progress.

Technical Architecture

1. VRAM Management & Context Pooling

Unlike traditional wrappers that initialize a full context per agent, llama-cpp-v3-agent-sdk uses an InferenceScheduler to manage physical memory.

Shared Engine: Multiple agents can share a single InferenceEngine instance, loading the model weights into VRAM only once.
Context Pooling: Agents do not own a LlamaContext by default. Instead, they request a temporary permit from a pool.
Serialization: Use InferenceScheduler::new(1) to ensure only one agent runs inference at a time, allowing multiple complex agents to run on consumer hardware with limited VRAM.

2. Multi-Agent Workflow Engine

The WorkflowEngine is a stateless runner designed to orchestrate sequences of agents. It decouples the execution logic from state persistence.

Input Resolution (input_mapping): Steps define data dependencies. The engine resolves keys from the initial context and overrides them with artifacts from named previous steps. JSON artifacts are automatically parsed into structured objects for prompts.
Conditional Execution: Supports dynamic skipping based on previous outputs using dot-notation (e.g., step_name.field_name) to look into dependency JSON results.
Workflow Storage: To maintain state, implement the WorkflowStorage trait. This enables Workflow Resumption, skipping already-completed steps if a pipeline is restarted with the same session_id.

3. Conversation & Message Compaction

The Conversation struct manages message history. When the context limit (n_ctx) is reached, the agent employs a compaction strategy to summarize or prune older messages while preserving the system prompt and critical context.

4. Tool-Use Loop Mechanics

The agent follows a rigorous cycle:

Generate: The LLM generates text until it produces a <tool_call> block or hits a stop sequence.
Parse: The engine parses the block using a JSON schema for the requested tool.
Execute: The tool's execute method is called (subject to permissions).
Repeat: The tool output is appended as a tool message, and the cycle repeats until a final answer is produced.

Configuration Reference

`AgentBuilder` API

Model Loading (Standalone Mode)

Method	Description	Technical Detail
`.backend(Backend)`	Set compute target	Supports CUDA, Vulkan, SYCL, and CPU.
`.model_path(&str)`	GGUF model path	Path to the weights file.
`.n_gpu_layers(i32)`	GPU offloading	-1 for all, 0 for none.
`.chat_template(&str)`	Custom template	Jinja template for prompt formatting.
`.app_name(&str)`	App identifier	Used for cache directory naming.
`.cache_dir(PathBuf)`	DLL cache	Where runtime DLLs are downloaded/stored.
`.explicit_dll_path(PathBuf)`	Bypass download	Direct path to `llama.dll`.
`.dll_version(&str)`	Version pinning	Specific llama.cpp version to download.

Shared Engine Mode

Method	Description	Technical Detail
`.engine(Arc<InferenceEngine>)`	Shared weights	Injects an already loaded model.
`.scheduler(Arc<Scheduler>)`	Context pooling	Required for high-density agent deployments.

Inference & Sampling

Method	Description	Technical Detail
`.n_ctx(u32)`	Context size	Maximum tokens (history + generation).
`.temperature(f32)`	Creativity	Higher = more random, Lower = more deterministic.
`.max_tokens_per_completion(usize)`	Token limit	Hard cap on a single model call.
`.top_k(i32)`	Top-K sampling	Limits vocabulary to K most likely tokens.
`.min_p(f32)`	Min-P sampling	Filters tokens based on probability threshold.
`.repeat_penalty(f32)`	Repetition control	Penalizes already generated tokens.
`.stop_sequence(&str)`	Stop markers	Add strings that terminate generation.

Agent Behavior & Loop

Method	Description	Technical Detail
`.system_prompt(&str)`	Instructions	Defines the agent's persona and mission.
`.max_iterations(usize)`	Iteration cap	Prevents runaway tool-use loops.
`.auto_approve()`	YOLO mode	Disables the permission system.
`.permission_callback(F)`	Human-in-the-loop	Callback for manual tool approval.
`.tool(Box<dyn Tool>)`	Custom tools	Register your own tool implementations.
`.skip_builtin_tools()`	Minimal mode	Disables bash, read, write, etc.

Skills & Discovery

Method	Description	Technical Detail
`.skills_path(PathBuf)`	Search paths	Adds directories for skill discovery.
`.activate_skill(&str)`	Specific skill	Explicitly loads a named skill.
`.no_skills()`	Disable skills	Prevents scanning for skill folders.
`.no_agents_md()`	Disable registry	Prevents scanning for `AGENTS.md` files.

Documentation & Tutorials

Case Study: Drama Scene Generation - A complete guide to building a multi-agent storytelling pipeline.
Custom Tools Guide - Learn how to implement the Tool trait.
Skill Development - How to package prompts and workflows into reusable "Skills".

Quick Start (Standalone Agent)

use llama_cpp_v3_agent_sdk::{Agent, AgentBuilder};
use llama_cpp_v3::backend::Backend;

fn main() -> anyhow::Result<()> {
    let mut agent = AgentBuilder::new()
        .backend(Backend::Vulkan)
        .model_path("model.gguf")
        .n_gpu_layers(-1) // Offload all to GPU
        .system_prompt("You are a system administrator.")
        .auto_approve()
        .build()?;

    let response = agent.chat_simple("What is the disk usage in the current directory?")?;
    println!("Agent says: {}", response);
    Ok(())
}

License

MIT

llama-cpp-v3-agent-sdk 0.1.7