llama-cpp-v3-agent-sdk 0.1.7

Agentic tool-use loop on top of llama-cpp-v3 — local LLM agents with built-in tools
Documentation
# Case Study: Lord of the Rings - The Council of Elrond

This guide provides a comprehensive technical walkthrough for building a state-of-the-art narrative generation pipeline using the **Skill-based Workflow** system in `llama-cpp-v3-agent-sdk`. We will orchestrate a specialized "Pipeline of Experts" to generate high-stakes drama from *The Lord of the Rings*.

---

## 1. Core Philosophy: The Pipeline of Experts

Traditional "Mega-Prompts" often suffer from **Instruction Bleed**—a phenomenon where the LLM, overwhelmed by too many constraints, begins to ignore formatting rules, skips character nuances, or forgets specific lore requirements. 

`llama-cpp-v3-agent-sdk` solves this through **Agentic Isolation**. By splitting a complex task into multiple specialized steps, we ensure:
- **Focused Context**: Each agent only sees the information it needs for its specific task.
- **Strict Formatting**: Smaller prompts allow for 100% adherence to complex JSON schemas or Markdown structures.
- **VRAM Efficiency**: Only one agent's context is active at a time, allowing high-fidelity generation on consumer hardware.

---

## 2. Anatomy of a "Skill"

A **Skill** is a self-contained module that encapsulates a specific workflow. This modularity allows developers to swap out "Writer" or "Critic" prompts without changing the application code.

### Skill Directory Structure
```text
skills/lotr-scene-generator/
├── SKILL.md             # Metadata & Requirements
├── workflow.json        # Orchestration logic
├── prompts/             # Specialized agent instructions
│   ├── planner.md
│   ├── writer.md
│   └── critic.md
├── schemas/             # JSON validation schemas
│   └── review_report.json
└── references/          # Lore & Context
    └── lore_reference.md
```

### 2.1 Skill Metadata (`SKILL.md`)
This file serves as the documentation for the skill, outlining its purpose and required input context.

```markdown
---
name: lotr-scene-generator
description: Generate high-fidelity dramatic scenes set in Middle-earth. It is specifically tuned for the 'Council of Elrond' style of debate.
---

# Lord of the Rings: Council Scene Generator

This skill generates high-fidelity dramatic scenes set in Middle-earth. It is specifically tuned for the 'Council of Elrond' style of debate.

## Required Input Context
- `outline`: (String) A brief summary of the confrontation.
- `characters`: (Array) List of Tolkien characters to include.
- `lore_strictness`: (Number 0-1) How strictly to adhere to canon.
```

### 2.2 The Planner Prompt (`prompts/planner.md`)
The Planner is responsible for structure. It must output clean JSON.

```markdown
# Role: Middle-earth Scene Architect
You are an expert at narrative structure and Tolkien's storytelling patterns.

# Task
Deconstruct the provided scene outline into a detailed beat sheet.

# Requirements
1. Define 3-5 distinct emotional shifts.
2. Specify the "Lore Anchor" for this scene (e.g., the history of Isildur).
3. Identify the core conflict for each character.

# Output Format
You MUST output a valid JSON object following this structure:
{
  "beats": [{"description": "string", "emotion": "string"}],
  "lore_anchor": "string",
  "character_goals": {"character_name": "string"}
}
```

### 2.3 The Writer Prompt (`prompts/writer.md`)
The Writer focuses on dialogue and prose. It receives the Planner's JSON output.

```markdown
# Role: Epic Fantasy Dramatist
You are a master of dialogue, subtext, and the specific voices of Middle-earth.

# Input Specification
You will receive a `beats` object from the Architect.

# Voice Guidelines
- Elrond: Ancient, weary but hopeful, authoritative.
- Boromir: Proud, desperate, uses military metaphors.
- Aragorn: Quietly noble, guarded, uses archaic but simple speech.

# Task
Write the full screenplay for the scene. Use the provided beats to drive the tension.
```

### 2.4 The Critic Schema (`schemas/review_report.json`)
By providing a schema, you ensure the Critic's feedback is actionable by the engine.

```json
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "overall_score": { "type": "number", "minimum": 0, "maximum": 1 },
    "lore_errors_found": { "type": "boolean" },
    "voice_consistency": { "type": "string" },
    "must_fix_notes": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["overall_score", "lore_errors_found", "must_fix_notes"]
}
```

### 2.5 Lore Reference (`references/lore_reference.md`)
Static files included in the skill to ground the agents in your specific world.

```markdown
# The Nature of the One Ring
- It cannot be used for good, no matter the intent.
- It corrupts through the user's desire to do good (e.g., Boromir's desire to save Gondor).
- Only in the fires of Mount Doom can it be unmade.
```

---

## 3. The Orchestration Logic (`workflow.json`)

The `workflow.json` file is the declarative manifest of your pipeline. It manages data dependencies and control flow.

```json
{
  "name": "Middle-earth Narrative Pipeline",
  "steps": [
    {
      "name": "planner",
      "description": "Constructing scene architecture for Rivendell...",
      "agent_prompt": "prompts/planner.md",
      "temperature": 0.2,
      "output_type": "json"
    },
    {
      "name": "writer",
      "description": "Drafting the Council of Elrond screenplay...",
      "agent_prompt": "prompts/writer.md",
      "temperature": 0.8,
      "stop_sequences": ["# END OF SCENE"],
      "input_mapping": {
        "beats": "planner"
      }
    },
    {
      "name": "critic",
      "description": "Validating Ring-lore and character voices...",
      "agent_prompt": "prompts/critic.md",
      "temperature": 0.1,
      "output_type": "json",
      "input_mapping": {
        "initial_outline": "outline",
        "draft": "writer"
      }
    },
    {
      "name": "rewrite",
      "description": "Refining the dialogue based on Critic's lore check...",
      "agent_prompt": "prompts/rewrite.md",
      "conditional": "critic.lore_errors_found",
      "input_mapping": {
        "original_draft": "writer",
        "lore_report": "critic"
      }
    }
  ]
}
```

### In-Depth Feature Explanations:

#### A. Input Mapping (`input_mapping`)
This is the "wiring" of your pipeline. By default, every step receives the initial global context. However, `input_mapping` allows you to inject results from previous steps into specific keys.
- **Example**: The `writer` step above will receive a JSON object with a key `"beats"` containing the result of the `planner` step. The engine handles the lookup and merging automatically.

#### B. Conditional Execution (`conditional`)
The engine uses dot-notation (e.g., `critic.lore_errors_found`) to determine if a step should run.
1. The engine fetches the `"critic"` result.
2. It attempts to parse it as JSON.
3. It checks the value of `"lore_errors_found"`. If `false`, the `rewrite` step is entirely bypassed.

#### C. Output Types
Setting `output_type: "json"` triggers an internal "Sanitization Pass." The engine extracts the `{...}` block from the agent's output, cleans control characters, and parses it. This ensures that downstream agents receive clean data objects rather than raw strings with conversational filler.

---

## 4. State Persistence: Implementing `WorkflowStorage`

For a professional system, losing progress due to a network error or crash is unacceptable. The `WorkflowEngine` uses a **Stateless Engine + Stateful Storage** pattern.

By implementing `WorkflowStorage` (e.g., using SQLite), you enable **Stateful Resumption**.

### Implementation Example (SQLite)
```rust
use llama_cpp_v3_agent_sdk::workflow::{WorkflowStorage, Result};
use std::collections::HashMap;

impl WorkflowStorage for SqliteArchive {
    fn insert_artifact(&self, session_id: &str, artifact_type: &str, content: &str, is_json: bool) -> Result<()> {
        // Record the artifact with a timestamp. 
        // This 'artifact_type' maps to the 'step.name' in workflow.json.
        self.conn.execute(
            "INSERT INTO artifacts (session_id, step_name, content) VALUES (?1, ?2, ?3)",
            params![session_id, artifact_type, content]
        )?;
        Ok(())
    }

    fn get_latest_artifacts(&self, session_id: &str) -> Result<HashMap<String, String>> {
        // Load the most recent artifacts for this session.
        // The engine uses this to populate context for resumed runs.
        let mut results = HashMap::new();
        // ... fetch from DB ...
        Ok(results)
    }
}
```

---

## 5. Execution & Lifecycle Management

The `WorkflowEngine` manages the VRAM lifecycle of agents. To keep memory usage low, it follows a **Build-Run-Drop** pattern:
1. **Build**: Constructs an `Agent` using the shared `InferenceEngine` and `InferenceScheduler`.
2. **Run**: Executes the inference and streams tokens.
3. **Drop**: The agent and its associated `LlamaContext` are dropped immediately after completion, freeing VRAM for the next step.

### Running the Council of Elrond Pipeline
```rust
use llama_cpp_v3_agent_sdk::workflow::{WorkflowEngine, PipelineEvent};

let engine = WorkflowEngine::new(inference_engine, scheduler, my_storage, skills_path);

// 1. Define the scene requirements
let context = json!({
    "outline": "Boromir demands the Ring for the defense of Gondor. Aragorn reveals himself as the Heir of Isildur.",
    "tone": "Grandiose and Tense"
});

// 2. Start the orchestrated run
let results = engine.run(
    "lotr-generator", 
    "council-session-001", 
    context, 
    None, // resume_from_step: used to restart from a specific failed point
    vec![], // force_regenerate: used to ignore cache and redo specific steps
    |event| match event {
        PipelineEvent::StepStarted { name, .. } => {
            println!("\n[PHASE: {}]", name.to_uppercase());
        },
        PipelineEvent::Token { token, .. } => {
            print!("{}", token);
            io::stdout().flush().unwrap();
        },
        PipelineEvent::Processing { message, .. } => {
            println!("\n[ENGINE]: {}", message);
        },
        _ => {}
    }
).await?;
```

---

## 6. Advanced: Output Post-Processing

Sometimes agents generate syntax that is slightly off (e.g., they might use `(Action)` instead of the required `[ACTION]` tag). You can define **Post-Processing Rules** in your skill:

### `skills/lotr-generator/schemas/post_process.json`
```json
{
  "rules": [
    {
      "pattern": "\\((.*?)\\)",
      "replacement": "[ACTION: $1]"
    }
  ]
}
```
The engine automatically applies these regex-based rules to the `writer` and `rewrite` steps before persisting the results. This ensures your final data is always compliant with your application's requirements.

---

## 7. Developer Best Practices

1. **Low Temperature for Logic**: Set `temperature: 0.1` or `0.2` for `planner` and `critic` steps to ensure deterministic and logical results.
2. **High Temperature for Prose**: Set `temperature: 0.8` for the `writer` to allow for varied and creative dialogue.
3. **Schema Validation**: Always use `output_type: "json"` for steps that drive logic (like the `critic`), as this allows for robust conditional branching.
4. **Session Isolation**: Use unique `session_id`s for every generation request to prevent data corruption in the storage layer.