Skip to main content

Crate llama_cpp_v3_agent_sdk

Crate llama_cpp_v3_agent_sdk 

Source
Expand description

§llama-cpp-v3-agent-sdk

Agentic tool-use loop and multi-agent orchestration on top of llama-cpp-v3. Built for developers who need high-performance, local LLM agents with predictable state management and VRAM efficiency.

§Features

  • 🤖 Agentic Loop: Generate → parse tool calls → execute → repeat.
  • 🔧 Native Tooling: High-fidelity bash, read, write, edit, and glob implementations.
  • 🔒 Granular Permissions: Callback-based system for human-in-the-loop tool approval.
  • 🏗️ Memory-Mapped Inference: Shared model weights across multiple agents via InferenceEngine.
  • 🌊 Workflow Orchestration: Define complex multi-agent pipelines in JSON with dynamic input mapping and conditional execution.
  • 📡 Reactive Events: Stream text deltas, tool lifecycle events, and pipeline progress.

§Technical Architecture

§1. VRAM Management & Context Pooling

Unlike traditional wrappers that initialize a full context per agent, llama-cpp-v3-agent-sdk uses an InferenceScheduler to manage physical memory.

  • Shared Engine: Multiple agents can share a single InferenceEngine instance, loading the model weights into VRAM only once.
  • Context Pooling: Agents do not own a LlamaContext by default. Instead, they request a temporary permit from a pool.
  • Serialization: Use InferenceScheduler::new(1) to ensure only one agent runs inference at a time, allowing multiple complex agents to run on consumer hardware with limited VRAM.

§2. Multi-Agent Workflow Engine

The WorkflowEngine is a stateless runner designed to orchestrate sequences of agents. It decouples the execution logic from state persistence.

  • Input Resolution (input_mapping): Steps define data dependencies. The engine resolves keys from the initial context and overrides them with artifacts from named previous steps. JSON artifacts are automatically parsed into structured objects for prompts.
  • Conditional Execution: Supports dynamic skipping based on previous outputs using dot-notation (e.g., step_name.field_name) to look into dependency JSON results.
  • Workflow Storage: To maintain state, implement the WorkflowStorage trait. This enables Workflow Resumption, skipping already-completed steps if a pipeline is restarted with the same session_id.

§3. Conversation & Message Compaction

The Conversation struct manages message history. When the context limit (n_ctx) is reached, the agent employs a compaction strategy to summarize or prune older messages while preserving the system prompt and critical context.

§4. Tool-Use Loop Mechanics

The agent follows a rigorous cycle:

  1. Generate: The LLM generates text until it produces a <tool_call> block or hits a stop sequence.
  2. Parse: The engine parses the block using a JSON schema for the requested tool.
  3. Execute: The tool’s execute method is called (subject to permissions).
  4. Repeat: The tool output is appended as a tool message, and the cycle repeats until a final answer is produced.

§Configuration Reference

§AgentBuilder API

§Model Loading (Standalone Mode)
MethodDescriptionTechnical Detail
.backend(Backend)Set compute targetSupports CUDA, Vulkan, SYCL, and CPU.
.model_path(&str)GGUF model pathPath to the weights file.
.n_gpu_layers(i32)GPU offloading-1 for all, 0 for none.
.chat_template(&str)Custom templateJinja template for prompt formatting.
.app_name(&str)App identifierUsed for cache directory naming.
.cache_dir(PathBuf)DLL cacheWhere runtime DLLs are downloaded/stored.
.explicit_dll_path(PathBuf)Bypass downloadDirect path to llama.dll.
.dll_version(&str)Version pinningSpecific llama.cpp version to download.
§Shared Engine Mode
MethodDescriptionTechnical Detail
.engine(Arc<InferenceEngine>)Shared weightsInjects an already loaded model.
.scheduler(Arc<Scheduler>)Context poolingRequired for high-density agent deployments.
§Inference & Sampling
MethodDescriptionTechnical Detail
.n_ctx(u32)Context sizeMaximum tokens (history + generation).
.temperature(f32)CreativityHigher = more random, Lower = more deterministic.
.max_tokens_per_completion(usize)Token limitHard cap on a single model call.
.top_k(i32)Top-K samplingLimits vocabulary to K most likely tokens.
.min_p(f32)Min-P samplingFilters tokens based on probability threshold.
.repeat_penalty(f32)Repetition controlPenalizes already generated tokens.
.stop_sequence(&str)Stop markersAdd strings that terminate generation.
§Agent Behavior & Loop
MethodDescriptionTechnical Detail
.system_prompt(&str)InstructionsDefines the agent’s persona and mission.
.max_iterations(usize)Iteration capPrevents runaway tool-use loops.
.auto_approve()YOLO modeDisables the permission system.
.permission_callback(F)Human-in-the-loopCallback for manual tool approval.
.tool(Box<dyn Tool>)Custom toolsRegister your own tool implementations.
.skip_builtin_tools()Minimal modeDisables bash, read, write, etc.
§Skills & Discovery
MethodDescriptionTechnical Detail
.skills_path(PathBuf)Search pathsAdds directories for skill discovery.
.activate_skill(&str)Specific skillExplicitly loads a named skill.
.no_skills()Disable skillsPrevents scanning for skill folders.
.no_agents_md()Disable registryPrevents scanning for AGENTS.md files.

§Documentation & Tutorials


§Quick Start (Standalone Agent)

use llama_cpp_v3_agent_sdk::{Agent, AgentBuilder};
use llama_cpp_v3::backend::Backend;

fn main() -> anyhow::Result<()> {
    let mut agent = AgentBuilder::new()
        .backend(Backend::Vulkan)
        .model_path("model.gguf")
        .n_gpu_layers(-1) // Offload all to GPU
        .system_prompt("You are a system administrator.")
        .auto_approve()
        .build()?;

    let response = agent.chat_simple("What is the disk usage in the current directory?")?;
    println!("Agent says: {}", response);
    Ok(())
}

§License

MIT


§Multi-Agent Workflow Guide (In-Depth Walkthrough)

Below is the complete technical walkthrough for building complex multi-agent pipelines.

§Case Study: Lord of the Rings - The Council of Elrond

This guide provides a comprehensive technical walkthrough for building a state-of-the-art narrative generation pipeline using the Skill-based Workflow system in llama-cpp-v3-agent-sdk. We will orchestrate a specialized “Pipeline of Experts” to generate high-stakes drama from The Lord of the Rings.


§1. Core Philosophy: The Pipeline of Experts

Traditional “Mega-Prompts” often suffer from Instruction Bleed—a phenomenon where the LLM, overwhelmed by too many constraints, begins to ignore formatting rules, skips character nuances, or forgets specific lore requirements.

llama-cpp-v3-agent-sdk solves this through Agentic Isolation. By splitting a complex task into multiple specialized steps, we ensure:

  • Focused Context: Each agent only sees the information it needs for its specific task.
  • Strict Formatting: Smaller prompts allow for 100% adherence to complex JSON schemas or Markdown structures.
  • VRAM Efficiency: Only one agent’s context is active at a time, allowing high-fidelity generation on consumer hardware.

§2. Anatomy of a “Skill”

A Skill is a self-contained module that encapsulates a specific workflow. This modularity allows developers to swap out “Writer” or “Critic” prompts without changing the application code.

§Skill Directory Structure

skills/lotr-scene-generator/
├── SKILL.md             # Metadata & Requirements
├── workflow.json        # Orchestration logic
├── prompts/             # Specialized agent instructions
│   ├── planner.md
│   ├── writer.md
│   └── critic.md
├── schemas/             # JSON validation schemas
│   └── review_report.json
└── references/          # Lore & Context
    └── lore_reference.md

§2.1 Skill Metadata (SKILL.md)

This file serves as the documentation for the skill, outlining its purpose and required input context.

---
name: lotr-scene-generator
description: Generate high-fidelity dramatic scenes set in Middle-earth. It is specifically tuned for the 'Council of Elrond' style of debate.
---

# Lord of the Rings: Council Scene Generator

This skill generates high-fidelity dramatic scenes set in Middle-earth. It is specifically tuned for the 'Council of Elrond' style of debate.

## Required Input Context
- `outline`: (String) A brief summary of the confrontation.
- `characters`: (Array) List of Tolkien characters to include.
- `lore_strictness`: (Number 0-1) How strictly to adhere to canon.

§2.2 The Planner Prompt (prompts/planner.md)

The Planner is responsible for structure. It must output clean JSON.

# Role: Middle-earth Scene Architect
You are an expert at narrative structure and Tolkien's storytelling patterns.

# Task
Deconstruct the provided scene outline into a detailed beat sheet.

# Requirements
1. Define 3-5 distinct emotional shifts.
2. Specify the "Lore Anchor" for this scene (e.g., the history of Isildur).
3. Identify the core conflict for each character.

# Output Format
You MUST output a valid JSON object following this structure:
{
  "beats": [{"description": "string", "emotion": "string"}],
  "lore_anchor": "string",
  "character_goals": {"character_name": "string"}
}

§2.3 The Writer Prompt (prompts/writer.md)

The Writer focuses on dialogue and prose. It receives the Planner’s JSON output.

# Role: Epic Fantasy Dramatist
You are a master of dialogue, subtext, and the specific voices of Middle-earth.

# Input Specification
You will receive a `beats` object from the Architect.

# Voice Guidelines
- Elrond: Ancient, weary but hopeful, authoritative.
- Boromir: Proud, desperate, uses military metaphors.
- Aragorn: Quietly noble, guarded, uses archaic but simple speech.

# Task
Write the full screenplay for the scene. Use the provided beats to drive the tension.

§2.4 The Critic Schema (schemas/review_report.json)

By providing a schema, you ensure the Critic’s feedback is actionable by the engine.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "overall_score": { "type": "number", "minimum": 0, "maximum": 1 },
    "lore_errors_found": { "type": "boolean" },
    "voice_consistency": { "type": "string" },
    "must_fix_notes": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["overall_score", "lore_errors_found", "must_fix_notes"]
}

§2.5 Lore Reference (references/lore_reference.md)

Static files included in the skill to ground the agents in your specific world.

# The Nature of the One Ring
- It cannot be used for good, no matter the intent.
- It corrupts through the user's desire to do good (e.g., Boromir's desire to save Gondor).
- Only in the fires of Mount Doom can it be unmade.

§3. The Orchestration Logic (workflow.json)

The workflow.json file is the declarative manifest of your pipeline. It manages data dependencies and control flow.

{
  "name": "Middle-earth Narrative Pipeline",
  "steps": [
    {
      "name": "planner",
      "description": "Constructing scene architecture for Rivendell...",
      "agent_prompt": "prompts/planner.md",
      "temperature": 0.2,
      "output_type": "json"
    },
    {
      "name": "writer",
      "description": "Drafting the Council of Elrond screenplay...",
      "agent_prompt": "prompts/writer.md",
      "temperature": 0.8,
      "stop_sequences": ["# END OF SCENE"],
      "input_mapping": {
        "beats": "planner"
      }
    },
    {
      "name": "critic",
      "description": "Validating Ring-lore and character voices...",
      "agent_prompt": "prompts/critic.md",
      "temperature": 0.1,
      "output_type": "json",
      "input_mapping": {
        "initial_outline": "outline",
        "draft": "writer"
      }
    },
    {
      "name": "rewrite",
      "description": "Refining the dialogue based on Critic's lore check...",
      "agent_prompt": "prompts/rewrite.md",
      "conditional": "critic.lore_errors_found",
      "input_mapping": {
        "original_draft": "writer",
        "lore_report": "critic"
      }
    }
  ]
}

§In-Depth Feature Explanations:

§A. Input Mapping (input_mapping)

This is the “wiring” of your pipeline. By default, every step receives the initial global context. However, input_mapping allows you to inject results from previous steps into specific keys.

  • Example: The writer step above will receive a JSON object with a key "beats" containing the result of the planner step. The engine handles the lookup and merging automatically.
§B. Conditional Execution (conditional)

The engine uses dot-notation (e.g., critic.lore_errors_found) to determine if a step should run.

  1. The engine fetches the "critic" result.
  2. It attempts to parse it as JSON.
  3. It checks the value of "lore_errors_found". If false, the rewrite step is entirely bypassed.
§C. Output Types

Setting output_type: "json" triggers an internal “Sanitization Pass.” The engine extracts the {...} block from the agent’s output, cleans control characters, and parses it. This ensures that downstream agents receive clean data objects rather than raw strings with conversational filler.


§4. State Persistence: Implementing WorkflowStorage

For a professional system, losing progress due to a network error or crash is unacceptable. The WorkflowEngine uses a Stateless Engine + Stateful Storage pattern.

By implementing WorkflowStorage (e.g., using SQLite), you enable Stateful Resumption.

§Implementation Example (SQLite)

use llama_cpp_v3_agent_sdk::workflow::{WorkflowStorage, Result};
use std::collections::HashMap;

impl WorkflowStorage for SqliteArchive {
    fn insert_artifact(&self, session_id: &str, artifact_type: &str, content: &str, is_json: bool) -> Result<()> {
        // Record the artifact with a timestamp. 
        // This 'artifact_type' maps to the 'step.name' in workflow.json.
        self.conn.execute(
            "INSERT INTO artifacts (session_id, step_name, content) VALUES (?1, ?2, ?3)",
            params![session_id, artifact_type, content]
        )?;
        Ok(())
    }

    fn get_latest_artifacts(&self, session_id: &str) -> Result<HashMap<String, String>> {
        // Load the most recent artifacts for this session.
        // The engine uses this to populate context for resumed runs.
        let mut results = HashMap::new();
        // ... fetch from DB ...
        Ok(results)
    }
}

§5. Execution & Lifecycle Management

The WorkflowEngine manages the VRAM lifecycle of agents. To keep memory usage low, it follows a Build-Run-Drop pattern:

  1. Build: Constructs an Agent using the shared InferenceEngine and InferenceScheduler.
  2. Run: Executes the inference and streams tokens.
  3. Drop: The agent and its associated LlamaContext are dropped immediately after completion, freeing VRAM for the next step.

§Running the Council of Elrond Pipeline

use llama_cpp_v3_agent_sdk::workflow::{WorkflowEngine, PipelineEvent};

let engine = WorkflowEngine::new(inference_engine, scheduler, my_storage, skills_path);

// 1. Define the scene requirements
let context = json!({
    "outline": "Boromir demands the Ring for the defense of Gondor. Aragorn reveals himself as the Heir of Isildur.",
    "tone": "Grandiose and Tense"
});

// 2. Start the orchestrated run
let results = engine.run(
    "lotr-generator", 
    "council-session-001", 
    context, 
    None, // resume_from_step: used to restart from a specific failed point
    vec![], // force_regenerate: used to ignore cache and redo specific steps
    |event| match event {
        PipelineEvent::StepStarted { name, .. } => {
            println!("\n[PHASE: {}]", name.to_uppercase());
        },
        PipelineEvent::Token { token, .. } => {
            print!("{}", token);
            io::stdout().flush().unwrap();
        },
        PipelineEvent::Processing { message, .. } => {
            println!("\n[ENGINE]: {}", message);
        },
        _ => {}
    }
).await?;

§6. Advanced: Output Post-Processing

Sometimes agents generate syntax that is slightly off (e.g., they might use (Action) instead of the required [ACTION] tag). You can define Post-Processing Rules in your skill:

§skills/lotr-generator/schemas/post_process.json

{
  "rules": [
    {
      "pattern": "\\((.*?)\\)",
      "replacement": "[ACTION: $1]"
    }
  ]
}

The engine automatically applies these regex-based rules to the writer and rewrite steps before persisting the results. This ensures your final data is always compliant with your application’s requirements.


§7. Developer Best Practices

  1. Low Temperature for Logic: Set temperature: 0.1 or 0.2 for planner and critic steps to ensure deterministic and logical results.
  2. High Temperature for Prose: Set temperature: 0.8 for the writer to allow for varied and creative dialogue.
  3. Schema Validation: Always use output_type: "json" for steps that drive logic (like the critic), as this allows for robust conditional branching.
  4. Session Isolation: Use unique session_ids for every generation request to prevent data corruption in the storage layer.

Re-exports§

pub use agent::Agent;
pub use agent::AgentBuilder;
pub use agent_loop::AgentEvent;
pub use agent_loop::AgentLoopConfig;
pub use agent_loop::CompletionReason;
pub use agent_loop::KvCacheState;
pub use agents_md::AgentsMdFile;
pub use agents_md::AgentsMdRegistry;
pub use conversation::Conversation;
pub use conversation::Message;
pub use conversation::Role;
pub use error::AgentError;
pub use inference::templates;
pub use inference::InferenceConfig;
pub use inference::InferenceEngine;
pub use inference::InferenceScheduler;
pub use permission::PermissionDecision;
pub use permission::PermissionMode;
pub use permission::PermissionRequest;
pub use permission::PermissionTracker;
pub use skills::Skill;
pub use skills::SkillMeta;
pub use skills::SkillRegistry;
pub use tool::Tool;
pub use tool::ToolCall;
pub use tool::ToolRegistry;
pub use tool::ToolResult;
pub use workflow::*;

Modules§

agent
agent_loop
agents_md
AGENTS.md discovery and loading.
conversation
error
inference
Shared inference engine — load a model once, share it across agents.
permission
skills
Skill discovery and loading.
tool
tools
workflow