Expand description
§llama-cpp-v3-agent-sdk
Agentic tool-use loop and multi-agent orchestration on top of llama-cpp-v3. Built for developers who need high-performance, local LLM agents with predictable state management and VRAM efficiency.
§Features
- 🤖 Agentic Loop: Generate → parse tool calls → execute → repeat.
- 🔧 Native Tooling: High-fidelity
bash,read,write,edit, andglobimplementations. - 🔒 Granular Permissions: Callback-based system for human-in-the-loop tool approval.
- 🏗️ Memory-Mapped Inference: Shared model weights across multiple agents via
InferenceEngine. - 🌊 Workflow Orchestration: Define complex multi-agent pipelines in JSON with dynamic input mapping and conditional execution.
- 📡 Reactive Events: Stream text deltas, tool lifecycle events, and pipeline progress.
§Technical Architecture
§1. VRAM Management & Context Pooling
Unlike traditional wrappers that initialize a full context per agent, llama-cpp-v3-agent-sdk uses an InferenceScheduler to manage physical memory.
- Shared Engine: Multiple agents can share a single
InferenceEngineinstance, loading the model weights into VRAM only once. - Context Pooling: Agents do not own a
LlamaContextby default. Instead, they request a temporary permit from a pool. - Serialization: Use
InferenceScheduler::new(1)to ensure only one agent runs inference at a time, allowing multiple complex agents to run on consumer hardware with limited VRAM.
§2. Multi-Agent Workflow Engine
The WorkflowEngine is a stateless runner designed to orchestrate sequences of agents. It decouples the execution logic from state persistence.
- Input Resolution (
input_mapping): Steps define data dependencies. The engine resolves keys from the initial context and overrides them with artifacts from named previous steps. JSON artifacts are automatically parsed into structured objects for prompts. - Conditional Execution: Supports dynamic skipping based on previous outputs using dot-notation (e.g.,
step_name.field_name) to look into dependency JSON results. - Workflow Storage: To maintain state, implement the
WorkflowStoragetrait. This enables Workflow Resumption, skipping already-completed steps if a pipeline is restarted with the samesession_id.
§3. Conversation & Message Compaction
The Conversation struct manages message history. When the context limit (n_ctx) is reached, the agent employs a compaction strategy to summarize or prune older messages while preserving the system prompt and critical context.
§4. Tool-Use Loop Mechanics
The agent follows a rigorous cycle:
- Generate: The LLM generates text until it produces a
<tool_call>block or hits a stop sequence. - Parse: The engine parses the block using a JSON schema for the requested tool.
- Execute: The tool’s
executemethod is called (subject to permissions). - Repeat: The tool output is appended as a
toolmessage, and the cycle repeats until a final answer is produced.
§Configuration Reference
§AgentBuilder API
§Model Loading (Standalone Mode)
| Method | Description | Technical Detail |
|---|---|---|
.backend(Backend) | Set compute target | Supports CUDA, Vulkan, SYCL, and CPU. |
.model_path(&str) | GGUF model path | Path to the weights file. |
.n_gpu_layers(i32) | GPU offloading | -1 for all, 0 for none. |
.chat_template(&str) | Custom template | Jinja template for prompt formatting. |
.app_name(&str) | App identifier | Used for cache directory naming. |
.cache_dir(PathBuf) | DLL cache | Where runtime DLLs are downloaded/stored. |
.explicit_dll_path(PathBuf) | Bypass download | Direct path to llama.dll. |
.dll_version(&str) | Version pinning | Specific llama.cpp version to download. |
§Shared Engine Mode
| Method | Description | Technical Detail |
|---|---|---|
.engine(Arc<InferenceEngine>) | Shared weights | Injects an already loaded model. |
.scheduler(Arc<Scheduler>) | Context pooling | Required for high-density agent deployments. |
§Inference & Sampling
| Method | Description | Technical Detail |
|---|---|---|
.n_ctx(u32) | Context size | Maximum tokens (history + generation). |
.temperature(f32) | Creativity | Higher = more random, Lower = more deterministic. |
.max_tokens_per_completion(usize) | Token limit | Hard cap on a single model call. |
.top_k(i32) | Top-K sampling | Limits vocabulary to K most likely tokens. |
.min_p(f32) | Min-P sampling | Filters tokens based on probability threshold. |
.repeat_penalty(f32) | Repetition control | Penalizes already generated tokens. |
.stop_sequence(&str) | Stop markers | Add strings that terminate generation. |
§Agent Behavior & Loop
| Method | Description | Technical Detail |
|---|---|---|
.system_prompt(&str) | Instructions | Defines the agent’s persona and mission. |
.max_iterations(usize) | Iteration cap | Prevents runaway tool-use loops. |
.auto_approve() | YOLO mode | Disables the permission system. |
.permission_callback(F) | Human-in-the-loop | Callback for manual tool approval. |
.tool(Box<dyn Tool>) | Custom tools | Register your own tool implementations. |
.skip_builtin_tools() | Minimal mode | Disables bash, read, write, etc. |
§Skills & Discovery
| Method | Description | Technical Detail |
|---|---|---|
.skills_path(PathBuf) | Search paths | Adds directories for skill discovery. |
.activate_skill(&str) | Specific skill | Explicitly loads a named skill. |
.no_skills() | Disable skills | Prevents scanning for skill folders. |
.no_agents_md() | Disable registry | Prevents scanning for AGENTS.md files. |
§Documentation & Tutorials
- Case Study: Drama Scene Generation - A complete guide to building a multi-agent storytelling pipeline.
- Custom Tools Guide - Learn how to implement the
Tooltrait. - Skill Development - How to package prompts and workflows into reusable “Skills”.
§Quick Start (Standalone Agent)
use llama_cpp_v3_agent_sdk::{Agent, AgentBuilder};
use llama_cpp_v3::backend::Backend;
fn main() -> anyhow::Result<()> {
let mut agent = AgentBuilder::new()
.backend(Backend::Vulkan)
.model_path("model.gguf")
.n_gpu_layers(-1) // Offload all to GPU
.system_prompt("You are a system administrator.")
.auto_approve()
.build()?;
let response = agent.chat_simple("What is the disk usage in the current directory?")?;
println!("Agent says: {}", response);
Ok(())
}§License
MIT
§Multi-Agent Workflow Guide (In-Depth Walkthrough)
Below is the complete technical walkthrough for building complex multi-agent pipelines.
§Case Study: Lord of the Rings - The Council of Elrond
This guide provides a comprehensive technical walkthrough for building a state-of-the-art narrative generation pipeline using the Skill-based Workflow system in llama-cpp-v3-agent-sdk. We will orchestrate a specialized “Pipeline of Experts” to generate high-stakes drama from The Lord of the Rings.
§1. Core Philosophy: The Pipeline of Experts
Traditional “Mega-Prompts” often suffer from Instruction Bleed—a phenomenon where the LLM, overwhelmed by too many constraints, begins to ignore formatting rules, skips character nuances, or forgets specific lore requirements.
llama-cpp-v3-agent-sdk solves this through Agentic Isolation. By splitting a complex task into multiple specialized steps, we ensure:
- Focused Context: Each agent only sees the information it needs for its specific task.
- Strict Formatting: Smaller prompts allow for 100% adherence to complex JSON schemas or Markdown structures.
- VRAM Efficiency: Only one agent’s context is active at a time, allowing high-fidelity generation on consumer hardware.
§2. Anatomy of a “Skill”
A Skill is a self-contained module that encapsulates a specific workflow. This modularity allows developers to swap out “Writer” or “Critic” prompts without changing the application code.
§Skill Directory Structure
skills/lotr-scene-generator/
├── SKILL.md # Metadata & Requirements
├── workflow.json # Orchestration logic
├── prompts/ # Specialized agent instructions
│ ├── planner.md
│ ├── writer.md
│ └── critic.md
├── schemas/ # JSON validation schemas
│ └── review_report.json
└── references/ # Lore & Context
└── lore_reference.md§2.1 Skill Metadata (SKILL.md)
This file serves as the documentation for the skill, outlining its purpose and required input context.
---
name: lotr-scene-generator
description: Generate high-fidelity dramatic scenes set in Middle-earth. It is specifically tuned for the 'Council of Elrond' style of debate.
---
# Lord of the Rings: Council Scene Generator
This skill generates high-fidelity dramatic scenes set in Middle-earth. It is specifically tuned for the 'Council of Elrond' style of debate.
## Required Input Context
- `outline`: (String) A brief summary of the confrontation.
- `characters`: (Array) List of Tolkien characters to include.
- `lore_strictness`: (Number 0-1) How strictly to adhere to canon.§2.2 The Planner Prompt (prompts/planner.md)
The Planner is responsible for structure. It must output clean JSON.
# Role: Middle-earth Scene Architect
You are an expert at narrative structure and Tolkien's storytelling patterns.
# Task
Deconstruct the provided scene outline into a detailed beat sheet.
# Requirements
1. Define 3-5 distinct emotional shifts.
2. Specify the "Lore Anchor" for this scene (e.g., the history of Isildur).
3. Identify the core conflict for each character.
# Output Format
You MUST output a valid JSON object following this structure:
{
"beats": [{"description": "string", "emotion": "string"}],
"lore_anchor": "string",
"character_goals": {"character_name": "string"}
}§2.3 The Writer Prompt (prompts/writer.md)
The Writer focuses on dialogue and prose. It receives the Planner’s JSON output.
# Role: Epic Fantasy Dramatist
You are a master of dialogue, subtext, and the specific voices of Middle-earth.
# Input Specification
You will receive a `beats` object from the Architect.
# Voice Guidelines
- Elrond: Ancient, weary but hopeful, authoritative.
- Boromir: Proud, desperate, uses military metaphors.
- Aragorn: Quietly noble, guarded, uses archaic but simple speech.
# Task
Write the full screenplay for the scene. Use the provided beats to drive the tension.§2.4 The Critic Schema (schemas/review_report.json)
By providing a schema, you ensure the Critic’s feedback is actionable by the engine.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"overall_score": { "type": "number", "minimum": 0, "maximum": 1 },
"lore_errors_found": { "type": "boolean" },
"voice_consistency": { "type": "string" },
"must_fix_notes": { "type": "array", "items": { "type": "string" } }
},
"required": ["overall_score", "lore_errors_found", "must_fix_notes"]
}§2.5 Lore Reference (references/lore_reference.md)
Static files included in the skill to ground the agents in your specific world.
# The Nature of the One Ring
- It cannot be used for good, no matter the intent.
- It corrupts through the user's desire to do good (e.g., Boromir's desire to save Gondor).
- Only in the fires of Mount Doom can it be unmade.§3. The Orchestration Logic (workflow.json)
The workflow.json file is the declarative manifest of your pipeline. It manages data dependencies and control flow.
{
"name": "Middle-earth Narrative Pipeline",
"steps": [
{
"name": "planner",
"description": "Constructing scene architecture for Rivendell...",
"agent_prompt": "prompts/planner.md",
"temperature": 0.2,
"output_type": "json"
},
{
"name": "writer",
"description": "Drafting the Council of Elrond screenplay...",
"agent_prompt": "prompts/writer.md",
"temperature": 0.8,
"stop_sequences": ["# END OF SCENE"],
"input_mapping": {
"beats": "planner"
}
},
{
"name": "critic",
"description": "Validating Ring-lore and character voices...",
"agent_prompt": "prompts/critic.md",
"temperature": 0.1,
"output_type": "json",
"input_mapping": {
"initial_outline": "outline",
"draft": "writer"
}
},
{
"name": "rewrite",
"description": "Refining the dialogue based on Critic's lore check...",
"agent_prompt": "prompts/rewrite.md",
"conditional": "critic.lore_errors_found",
"input_mapping": {
"original_draft": "writer",
"lore_report": "critic"
}
}
]
}§In-Depth Feature Explanations:
§A. Input Mapping (input_mapping)
This is the “wiring” of your pipeline. By default, every step receives the initial global context. However, input_mapping allows you to inject results from previous steps into specific keys.
- Example: The
writerstep above will receive a JSON object with a key"beats"containing the result of theplannerstep. The engine handles the lookup and merging automatically.
§B. Conditional Execution (conditional)
The engine uses dot-notation (e.g., critic.lore_errors_found) to determine if a step should run.
- The engine fetches the
"critic"result. - It attempts to parse it as JSON.
- It checks the value of
"lore_errors_found". Iffalse, therewritestep is entirely bypassed.
§C. Output Types
Setting output_type: "json" triggers an internal “Sanitization Pass.” The engine extracts the {...} block from the agent’s output, cleans control characters, and parses it. This ensures that downstream agents receive clean data objects rather than raw strings with conversational filler.
§4. State Persistence: Implementing WorkflowStorage
For a professional system, losing progress due to a network error or crash is unacceptable. The WorkflowEngine uses a Stateless Engine + Stateful Storage pattern.
By implementing WorkflowStorage (e.g., using SQLite), you enable Stateful Resumption.
§Implementation Example (SQLite)
use llama_cpp_v3_agent_sdk::workflow::{WorkflowStorage, Result};
use std::collections::HashMap;
impl WorkflowStorage for SqliteArchive {
fn insert_artifact(&self, session_id: &str, artifact_type: &str, content: &str, is_json: bool) -> Result<()> {
// Record the artifact with a timestamp.
// This 'artifact_type' maps to the 'step.name' in workflow.json.
self.conn.execute(
"INSERT INTO artifacts (session_id, step_name, content) VALUES (?1, ?2, ?3)",
params![session_id, artifact_type, content]
)?;
Ok(())
}
fn get_latest_artifacts(&self, session_id: &str) -> Result<HashMap<String, String>> {
// Load the most recent artifacts for this session.
// The engine uses this to populate context for resumed runs.
let mut results = HashMap::new();
// ... fetch from DB ...
Ok(results)
}
}§5. Execution & Lifecycle Management
The WorkflowEngine manages the VRAM lifecycle of agents. To keep memory usage low, it follows a Build-Run-Drop pattern:
- Build: Constructs an
Agentusing the sharedInferenceEngineandInferenceScheduler. - Run: Executes the inference and streams tokens.
- Drop: The agent and its associated
LlamaContextare dropped immediately after completion, freeing VRAM for the next step.
§Running the Council of Elrond Pipeline
use llama_cpp_v3_agent_sdk::workflow::{WorkflowEngine, PipelineEvent};
let engine = WorkflowEngine::new(inference_engine, scheduler, my_storage, skills_path);
// 1. Define the scene requirements
let context = json!({
"outline": "Boromir demands the Ring for the defense of Gondor. Aragorn reveals himself as the Heir of Isildur.",
"tone": "Grandiose and Tense"
});
// 2. Start the orchestrated run
let results = engine.run(
"lotr-generator",
"council-session-001",
context,
None, // resume_from_step: used to restart from a specific failed point
vec![], // force_regenerate: used to ignore cache and redo specific steps
|event| match event {
PipelineEvent::StepStarted { name, .. } => {
println!("\n[PHASE: {}]", name.to_uppercase());
},
PipelineEvent::Token { token, .. } => {
print!("{}", token);
io::stdout().flush().unwrap();
},
PipelineEvent::Processing { message, .. } => {
println!("\n[ENGINE]: {}", message);
},
_ => {}
}
).await?;§6. Advanced: Output Post-Processing
Sometimes agents generate syntax that is slightly off (e.g., they might use (Action) instead of the required [ACTION] tag). You can define Post-Processing Rules in your skill:
§skills/lotr-generator/schemas/post_process.json
{
"rules": [
{
"pattern": "\\((.*?)\\)",
"replacement": "[ACTION: $1]"
}
]
}The engine automatically applies these regex-based rules to the writer and rewrite steps before persisting the results. This ensures your final data is always compliant with your application’s requirements.
§7. Developer Best Practices
- Low Temperature for Logic: Set
temperature: 0.1or0.2forplannerandcriticsteps to ensure deterministic and logical results. - High Temperature for Prose: Set
temperature: 0.8for thewriterto allow for varied and creative dialogue. - Schema Validation: Always use
output_type: "json"for steps that drive logic (like thecritic), as this allows for robust conditional branching. - Session Isolation: Use unique
session_ids for every generation request to prevent data corruption in the storage layer.
Re-exports§
pub use agent::Agent;pub use agent::AgentBuilder;pub use agent_loop::AgentEvent;pub use agent_loop::AgentLoopConfig;pub use agent_loop::CompletionReason;pub use agent_loop::KvCacheState;pub use agents_md::AgentsMdFile;pub use agents_md::AgentsMdRegistry;pub use conversation::Conversation;pub use conversation::Message;pub use conversation::Role;pub use error::AgentError;pub use inference::templates;pub use inference::InferenceConfig;pub use inference::InferenceEngine;pub use inference::InferenceScheduler;pub use permission::PermissionDecision;pub use permission::PermissionMode;pub use permission::PermissionRequest;pub use permission::PermissionTracker;pub use skills::Skill;pub use skills::SkillMeta;pub use skills::SkillRegistry;pub use tool::Tool;pub use tool::ToolCall;pub use tool::ToolRegistry;pub use tool::ToolResult;pub use workflow::*;
Modules§
- agent
- agent_
loop - agents_
md - AGENTS.md discovery and loading.
- conversation
- error
- inference
- Shared inference engine — load a model once, share it across agents.
- permission
- skills
- Skill discovery and loading.
- tool
- tools
- workflow