orion-core 0.5.0

Backend-agnostic agent harness for local LLM inference
Documentation

orion-core

Agent harness for local LLM inference. Backend-agnostic — bring your own model runtime (llama.cpp, MLX, cloud APIs, anything).

orion-core handles the conversation loop so you don't have to: context management, token budgets, streaming events, chat formatting, and an automatic tool-execution loop (the agent parses tool calls, runs your tools, feeds the results back, and repeats until the model gives a final answer — see tools).

How It Works

User sends "Hello"
   Agent.prompt("Hello")
     Context pipeline (prune old messages to fit token budget)
       Format prompt (ChatML template + tool definitions)
         LlmBackend.generate() (streams tokens one by one)
           AgentEvent stream (your UI subscribes here)
             If the model called tools: run them, append results, loop back
               Done (model returns a tool-free answer)

You implement one trait (LlmBackend) for your inference engine. orion-core handles everything above it.

Quick Start

use std::sync::Arc;
use orion_core::{Agent, AgentConfig, AgentEvent, LlmBackend};
use tokio::sync::mpsc;

// 1. Implement the backend trait for your engine (see `backend` below)
let backend: Arc<dyn LlmBackend> = Arc::new(MyBackend::new());

// 2. Create an agent
let mut agent = Agent::new(AgentConfig {
    system_prompt: "You are a helpful assistant.".into(),
    ..Default::default()
});

// 3. You supply the event channel; the agent streams events into it
//    while generation runs, then returns when the turn is done.
let (tx, mut rx) = mpsc::unbounded_channel::<AgentEvent>();

// Consume events concurrently — forward them to your UI.
let consumer = tokio::spawn(async move {
    while let Some(event) = rx.recv().await {
        match event {
            AgentEvent::MessageDelta { delta, .. } => print!("{delta}"),
            AgentEvent::MessageEnd { message } => {
                println!("\n\nDone: {} tokens", message.token_count.unwrap_or(0));
            }
            AgentEvent::ContextBudget { used_tokens, max_tokens, .. } => {
                println!("Context: {used_tokens}/{max_tokens} tokens");
            }
            AgentEvent::Error { message } => eprintln!("Error: {message}"),
            _ => {}
        }
    }
});

agent.prompt("What is Rust?", backend, tx).await?;
consumer.await?;

A complete, runnable version lives in examples/mock_backend.rs — try it with cargo run --example mock_backend.

Don't want to manage the channel yourself? agent.prompt_stream(text, backend) creates it for you and hands back (receiver, future) — drive the future (e.g. with tokio::join!) while you drain the receiver.

For a real over-the-wire backend, examples/openai_backend.rs implements LlmBackend against a streaming OpenAI-compatible /v1/completions endpoint (OpenAI, llama.cpp server, vLLM, LM Studio, Ollama). Run it with cargo run --example openai_backend --features openai-example.

Modules

agent — The Orchestrator

The Agent struct is the main entry point. It owns the conversation state and drives the prompt → LLM → response loop.

use orion_core::{Agent, AgentConfig, InferenceParams, ContextConfig};

let mut agent = Agent::new(AgentConfig {
    system_prompt: "You are a coding assistant.".into(),
    inference_params: InferenceParams {
        max_tokens: 4096,
        temperature: 0.4,
        context_size: 8192,
        n_threads: 6,
    },
    context_config: ContextConfig {
        max_context_tokens: 8192,
        max_response_tokens: 4096,
        ..Default::default()
    },
    ..Default::default()
});

// Change settings on the fly
agent.set_system_prompt("You are a pirate.");
agent.set_inference_params(InferenceParams { temperature: 1.2, ..Default::default() });

// Conversation management
agent.clear();                              // Reset conversation
agent.replace_messages(saved_messages);     // Restore a saved conversation

// Abort a running generation
agent.abort();

backend — Bring Your Own LLM

Implement the LlmBackend trait to plug in any inference engine:

use orion_core::{LlmBackend, InferenceParams, GenerationResult, TokenCallback, CoreResult};
use std::sync::atomic::AtomicBool;
use std::sync::Arc;

struct LlamaCppBackend {
    // your engine state
}

impl LlmBackend for LlamaCppBackend {
    fn generate(
        &self,
        prompt: &str,               // Fully formatted (chat template applied)
        params: &InferenceParams,    // max_tokens, temperature, context_size
        abort: Arc<AtomicBool>,      // Check this each token to support cancellation
        on_token: TokenCallback,     // Call with (token_text, count, tokens_per_sec)
    ) -> CoreResult<GenerationResult> {
        // Feed prompt, sample tokens, call on_token for each one.
        // Return final stats when done.
        todo!()
    }

    fn tokenize_count(&self, text: &str) -> CoreResult<u32> {
        // Count tokens without running inference.
        // Used by the context pipeline for budget management.
        todo!()
    }

    fn is_ready(&self) -> bool {
        // Whether a model is loaded and ready for inference.
        todo!()
    }
}

The backend runs on a blocking thread — no async required. orion-core handles the async orchestration.

messages — Conversation Data

Messages support five roles covering the full agent lifecycle:

use orion_core::Message;

// Standard conversation
let sys = Message::system("msg-1", "You are helpful.");
let user = Message::user("msg-2", "Hello!");
let asst = Message::assistant("msg-3", "Hi there!");

// Tool interaction
let result = Message::tool_result(
    "msg-4",            // message id
    "call-1",           // tool_call_id (links to the assistant's request)
    "read_file",        // tool name
    "file contents...", // result content
    false,              // is_error
);

Roles: System, User, Assistant, ToolCall, ToolResult

Every message has an id, timestamp, and optional token_count (populated after tokenization). Assistant messages can carry tool_calls; tool result messages carry a tool_result.

events — Real-Time UI Updates

The agent emits events as it processes. Subscribe to these for building reactive UIs.

Event sequence for a simple prompt:

AgentStart
TurnStart
MessageStart    { user message }
MessageEnd      { user message }
ContextBudget   { used: 120, max: 4096, included: 5, pruned: 0 }
MessageDelta    { delta: "Hello", tokens: 1, tps: 45.2 }
MessageDelta    { delta: " there", tokens: 2, tps: 46.1 }
MessageDelta    { delta: "!", tokens: 3, tps: 44.8 }
MessageEnd      { assistant message }
TurnEnd         { message, tool_results: [] }
AgentEnd        { messages: [...] }

Event sequence with tool calls:

AgentStart
MessageStart    { user message }
MessageEnd      { user message }
TurnStart
ContextBudget   ...
MessageDelta    ...
MessageEnd      { assistant message with tool_calls }
ToolExecStart   { tool_call_id, tool_name, args }
ToolExecUpdate  { partial progress }
ToolExecEnd     { result }
MessageStart    { tool_result message }
MessageEnd      { tool_result message }
TurnEnd         { message, tool_results: [...] }
TurnStart                            ← new turn: LLM responds to tool result
MessageDelta    ...
MessageEnd      { final assistant message }
TurnEnd         { message, tool_results: [] }
AgentEnd        { messages: [...] }

All event types:

Event When Key Data
AgentStart Processing begins
AgentEnd All done All new messages
TurnStart New LLM call begins
TurnEnd LLM call + tools done Assistant message, tool results
MessageStart Any message added Full message
MessageDelta Each streamed token delta, tokens_generated, tokens_per_sec
MessageEnd Message complete Full message
ToolExecStart Tool begins running Tool name, args
ToolExecUpdate Tool streams progress Partial output
ToolExecEnd Tool finished Result, is_error
ContextBudget After context prep Tokens used/max, messages included/pruned
Warning Non-fatal issue Warning text
Error Fatal error Error text

context — Token Budget Management

Handles the hard problem of fitting a conversation into a fixed-size context window.

What it does:

  1. Prunes old messages when the conversation exceeds the token budget (sliding window — keeps system prompt + most recent messages)
  2. Formats the surviving messages into a ChatML prompt string
  3. Reports how many tokens are used and how many messages were pruned
use orion_core::ChatMLTemplate;
use orion_core::context::{prepare_context, ContextConfig};

let token_counter = |text: &str| -> u32 { /* your tokenizer */ 0 };

// Prune to fit the budget *and* format with a chat template in one step.
let prepared = prepare_context(
    &ChatMLTemplate,           // any `ChatTemplate` impl
    "You are helpful.",        // system prompt
    &messages,                 // full conversation history
    &tool_schemas,             // tool schemas to inject (may be empty)
    &ContextConfig::default(),
    &token_counter,
)?;

// `prepared.prompt` is the formatted string to feed your backend.
println!(
    "{} tokens, {} kept, {} pruned",
    prepared.token_count, prepared.messages_included, prepared.messages_pruned,
);

The agent calls this automatically before each LLM call. You don't need to call it directly unless you want custom control.

Prune strategies (ContextConfig::prune_strategy):

  • SlidingWindow (default) — drop the oldest turns first to fit the budget.
  • Summarize — before pruning, the agent folds the oldest overflowing turns into a single pinned summary message (one extra backend call), so their gist survives instead of being dropped. Prior summaries are consolidated, so exactly one accumulates. Best-effort: if summarization fails it falls back to the sliding window.

Pinned messages. Any Message with pinned == true always survives pruning, regardless of budget or strategy. Build one with Message::user(id, text).pinned(), or toggle an existing message via agent.set_pinned(message_id, true). Pruning is turn-aware, so a pinned message keeps its whole turn (no orphaned pairs).

template — Chat Prompt Formats

Each model family wants its prompt wrapped a certain way. orion-core ships a ChatTemplate for the common ones and picks the right one automatically.

Supported families: ChatML (default), Llama 3, Llama 2, Mistral / Mixtral, Gemma / Gemma 2, Phi-3, DeepSeek (LLM chat), Command-R / Command-R+, Alpaca, and Vicuna.

  • detect_template(gguf_template) — inspects a GGUF metadata template string and returns the matching impl (falling back to ChatML when nothing matches).
  • template_from_name(name) — resolves a manual-override name (with common aliases, e.g. llama-2, phi-3, cohere) to a template, or None for an unimplemented family so the caller can fall back to auto-detection.
  • Agent::with_template(config, template) / agent.set_template(template) — set or swap the template at runtime.

Every template also implements the per-message and per-system formatting hooks the context pipeline needs for accurate token-budget accounting, and advertises tools through the same tool_call convention (see below).

tools — Give the Model Abilities

Agent::prompt drives the full cycle automatically: it injects your tool schemas into the system prompt, parses the model's tool calls out of its reply, runs the matching tool, appends the result to the conversation, and loops back to the model until it returns a tool-free answer (bounded by AgentConfig::max_tool_iterations, default 8). Each step emits ToolExecStart / ToolExecUpdate / ToolExecEnd events.

Tool-call convention. Templates advertise — and [parse_tool_calls] reads — a fenced JSON block:

```tool_call
{"name": "read_file", "arguments": {"path": "Cargo.toml"}}
```

A JSON array invokes several tools in one turn. Parsing is lenient: a ```json fence or a whole-message bare JSON object carrying both name and arguments also count, so smaller models still trigger tools when they drift from the exact format. Register tools with agent.set_tools(vec![Box::new(MyTool)]); with no tools registered, parsing is skipped entirely and replies pass through verbatim.

Define tools the model can call. Each tool has a name, description, JSON Schema for parameters, and an async execute function.

use orion_core::{Tool, ToolOutput, ToolSchema, CoreResult};
use orion_core::tools::ToolUpdateCallback;
use async_trait::async_trait;

struct ReadFileTool;

#[async_trait]
impl Tool for ReadFileTool {
    fn name(&self) -> &str { "read_file" }
    fn label(&self) -> &str { "Read File" }
    fn description(&self) -> &str { "Read a file from disk" }

    fn parameters_schema(&self) -> serde_json::Value {
        serde_json::json!({
            "type": "object",
            "properties": {
                "path": { "type": "string", "description": "File path to read" }
            },
            "required": ["path"]
        })
    }

    async fn execute(
        &self,
        _tool_call_id: &str,
        args: serde_json::Value,
        _on_update: Option<ToolUpdateCallback>,
    ) -> CoreResult<ToolOutput> {
        let path = args["path"].as_str().unwrap_or("");
        let content = std::fs::read_to_string(path)
            .map_err(|e| orion_core::CoreError::Tool(e.to_string()))?;
        Ok(ToolOutput {
            content,
            details: serde_json::json!({"path": path}),
        })
    }
}

// Register with the agent
agent.set_tools(vec![Box::new(ReadFileTool)]);

Tool schemas are automatically injected into the system prompt when formatting context, and Agent::prompt runs the full tool call → execute → feed result → LLM responds cycle for you (see the section intro above).

Opting out. The Tool trait and the execution loop live behind the default tools feature, which pulls in async-trait. Minimal consumers that only need plain chat can drop it:

orion-core = { version = "0.2", default-features = false }

Tool-call parsing (parse_tool_calls, ParsedToolCall) and ToolSchema stay available either way — only the Tool trait, ToolOutput, ToolUpdateCallback, and Agent::set_tools require the feature.

The code snippets in this README are mirrored by compile-checked doctests on the corresponding API items, so they can't silently drift from the real signatures. Run them with cargo test --doc.

error — Error Types

use orion_core::{CoreError, CoreResult};

// Error variants
CoreError::Backend("No model loaded".into())    // LLM backend issues
CoreError::Context("Token limit exceeded".into()) // Context pipeline issues
CoreError::Tool("File not found".into())          // Tool execution issues
CoreError::Agent("Empty message".into())          // Agent logic issues
CoreError::Aborted                                // User cancelled

All errors are serializable (implements Serialize) for easy transport over IPC.

Architecture

┌──────────────────────────────────────────────────┐
  Your Application (OrionPod, CLI, server, etc.)  
├──────────────────────────────────────────────────┤
  Agent                                           │
  ├── prompt("Hello")                             
  ├── Conversation state (Vec<Message>)           
  ├── System prompt                               │
  ├── Registered tools                            │
  └── AgentConfig (inference params, context cfg) 
├──────────────────────────────────────────────────┤
  Context Pipeline                                │
  └── prepare_context()  prune + template format │
├──────────────────────────────────────────────────┤
  LlmBackend (trait)  you implement this         │
  ├── generate()  run inference, stream tokens   │
  ├── tokenize_count()  count tokens             │
  └── is_ready()  check model status             │
├──────────────────────────────────────────────────┤
  Your inference engine                           │
  (llama.cpp, MLX, ONNX, cloud API, etc.)         
└──────────────────────────────────────────────────┘

Events flow upward through an unbounded channel (tokio::sync::mpsc). Your UI or application layer subscribes to AgentEvents and reacts in real time.

Stability

orion-core follows SemVer. While 0.x, a minor bump may carry breaking changes and a patch bump is additive/fixes only. CoreError and AgentEvent are #[non_exhaustive] — match them with a wildcard arm so new variants don't break your build. The MSRV is Rust 1.85 (raised only in a minor release), and that guarantee covers the default feature set; optional example features like openai-example may need a newer toolchain. See CONTRIBUTING.md for the full policy.

License

MIT © Kumar Anirudha. See LICENSE.