llama-cpp-v3-agent-sdk 0.1.7

Agentic tool-use loop on top of llama-cpp-v3 — local LLM agents with built-in tools
Documentation
# llama-cpp-v3-agent-sdk

Agentic tool-use loop and multi-agent orchestration on top of [llama-cpp-v3](../llama-cpp-v3/). Built for developers who need high-performance, local LLM agents with predictable state management and VRAM efficiency.

## Features

- 🤖 **Agentic Loop**: Generate → parse tool calls → execute → repeat.
- 🔧 **Native Tooling**: High-fidelity `bash`, `read`, `write`, `edit`, and `glob` implementations.
- 🔒 **Granular Permissions**: Callback-based system for human-in-the-loop tool approval.
- 🏗️ **Memory-Mapped Inference**: Shared model weights across multiple agents via `InferenceEngine`.
- 🌊 **Workflow Orchestration**: Define complex multi-agent pipelines in JSON with dynamic input mapping and conditional execution.
- 📡 **Reactive Events**: Stream text deltas, tool lifecycle events, and pipeline progress.

---

## Technical Architecture

### 1. VRAM Management & Context Pooling
Unlike traditional wrappers that initialize a full context per agent, `llama-cpp-v3-agent-sdk` uses an `InferenceScheduler` to manage physical memory. 
- **Shared Engine**: Multiple agents can share a single `InferenceEngine` instance, loading the model weights into VRAM only once.
- **Context Pooling**: Agents do not own a `LlamaContext` by default. Instead, they request a temporary permit from a pool.
- **Serialization**: Use `InferenceScheduler::new(1)` to ensure only one agent runs inference at a time, allowing multiple complex agents to run on consumer hardware with limited VRAM.

### 2. Multi-Agent Workflow Engine
The `WorkflowEngine` is a stateless runner designed to orchestrate sequences of agents. It decouples the **execution logic** from **state persistence**.

- **Input Resolution (`input_mapping`)**: Steps define data dependencies. The engine resolves keys from the initial context and overrides them with artifacts from named previous steps. JSON artifacts are automatically parsed into structured objects for prompts.
- **Conditional Execution**: Supports dynamic skipping based on previous outputs using dot-notation (e.g., `step_name.field_name`) to look into dependency JSON results.
- **Workflow Storage**: To maintain state, implement the `WorkflowStorage` trait. This enables **Workflow Resumption**, skipping already-completed steps if a pipeline is restarted with the same `session_id`.

### 3. Conversation & Message Compaction
The `Conversation` struct manages message history. When the context limit (`n_ctx`) is reached, the agent employs a compaction strategy to summarize or prune older messages while preserving the system prompt and critical context.

### 4. Tool-Use Loop Mechanics
The agent follows a rigorous cycle:
1. **Generate**: The LLM generates text until it produces a `<tool_call>` block or hits a stop sequence.
2. **Parse**: The engine parses the block using a JSON schema for the requested tool.
3. **Execute**: The tool's `execute` method is called (subject to permissions).
4. **Repeat**: The tool output is appended as a `tool` message, and the cycle repeats until a final answer is produced.

---

## Configuration Reference

### `AgentBuilder` API

#### Model Loading (Standalone Mode)
| Method | Description | Technical Detail |
|:---|:---|:---|
| `.backend(Backend)` | Set compute target | Supports CUDA, Vulkan, SYCL, and CPU. |
| `.model_path(&str)` | GGUF model path | Path to the weights file. |
| `.n_gpu_layers(i32)` | GPU offloading | -1 for all, 0 for none. |
| `.chat_template(&str)` | Custom template | Jinja template for prompt formatting. |
| `.app_name(&str)` | App identifier | Used for cache directory naming. |
| `.cache_dir(PathBuf)` | DLL cache | Where runtime DLLs are downloaded/stored. |
| `.explicit_dll_path(PathBuf)`| Bypass download | Direct path to `llama.dll`. |
| `.dll_version(&str)` | Version pinning | Specific llama.cpp version to download. |

#### Shared Engine Mode
| Method | Description | Technical Detail |
|:---|:---|:---|
| `.engine(Arc<InferenceEngine>)` | Shared weights | Injects an already loaded model. |
| `.scheduler(Arc<Scheduler>)` | Context pooling | Required for high-density agent deployments. |

#### Inference & Sampling
| Method | Description | Technical Detail |
|:---|:---|:---|
| `.n_ctx(u32)` | Context size | Maximum tokens (history + generation). |
| `.temperature(f32)` | Creativity | Higher = more random, Lower = more deterministic. |
| `.max_tokens_per_completion(usize)` | Token limit | Hard cap on a single model call. |
| `.top_k(i32)` | Top-K sampling | Limits vocabulary to K most likely tokens. |
| `.min_p(f32)` | Min-P sampling | Filters tokens based on probability threshold. |
| `.repeat_penalty(f32)`| Repetition control | Penalizes already generated tokens. |
| `.stop_sequence(&str)`| Stop markers | Add strings that terminate generation. |

#### Agent Behavior & Loop
| Method | Description | Technical Detail |
|:---|:---|:---|
| `.system_prompt(&str)`| Instructions | Defines the agent's persona and mission. |
| `.max_iterations(usize)`| Iteration cap | Prevents runaway tool-use loops. |
| `.auto_approve()` | YOLO mode | Disables the permission system. |
| `.permission_callback(F)`| Human-in-the-loop | Callback for manual tool approval. |
| `.tool(Box<dyn Tool>)`| Custom tools | Register your own tool implementations. |
| `.skip_builtin_tools()`| Minimal mode | Disables bash, read, write, etc. |

#### Skills & Discovery
| Method | Description | Technical Detail |
|:---|:---|:---|
| `.skills_path(PathBuf)`| Search paths | Adds directories for skill discovery. |
| `.activate_skill(&str)`| Specific skill | Explicitly loads a named skill. |
| `.no_skills()` | Disable skills | Prevents scanning for skill folders. |
| `.no_agents_md()` | Disable registry | Prevents scanning for `AGENTS.md` files. |

---

## Documentation & Tutorials

- [Case Study: Drama Scene Generation]./DOCS_DRAMA_WORKFLOW.md - A complete guide to building a multi-agent storytelling pipeline.
- [Custom Tools Guide]./src/tool.rs - Learn how to implement the `Tool` trait.
- [Skill Development]./src/skills.rs - How to package prompts and workflows into reusable "Skills".

---

## Quick Start (Standalone Agent)

```rust
use llama_cpp_v3_agent_sdk::{Agent, AgentBuilder};
use llama_cpp_v3::backend::Backend;

fn main() -> anyhow::Result<()> {
    let mut agent = AgentBuilder::new()
        .backend(Backend::Vulkan)
        .model_path("model.gguf")
        .n_gpu_layers(-1) // Offload all to GPU
        .system_prompt("You are a system administrator.")
        .auto_approve()
        .build()?;

    let response = agent.chat_simple("What is the disk usage in the current directory?")?;
    println!("Agent says: {}", response);
    Ok(())
}
```

## License
MIT