slm_inference 0.1.1

Backend-agnostic trait layer for running Small Language Model (SLM) inference in Rust.
Documentation

slm_inference

Backend-agnostic trait layer for running Small Language Model (SLM) inference in Rust.

Idea

This crate defines a set of composable traits that abstract over the full inference pipeline — from loading a GGUF model file to producing structured chat — without being tied to any specific backend (llama.cpp, ik_llama.cpp, etc.).

SlmModelConfig  →  load_gguf()  →  SlmModel
                                        ↓
                               SlmContextBuilder  →  SlmContext
                                                          ↓
                                                    SlmInference  +  SlmFormatter
                                                          ↓
                                                    SlmSimpleOracle  (implements SlmOracle)

Core Traits

  • SlmModelConfig — knows how to load a GGUF file and produce a SlmModel.
  • SlmModel — owns the loaded weights and creates a SlmContextBuilder.
  • SlmContextBuilder — configures sampling (temperature, top-k, top-p) and builds a SlmContext.
  • SlmContext — the stateful session: tokenizes input, runs batched decode, and samples tokens.
  • SlmBatch / SlmToken — low-level primitives for feeding tokens to the context.
  • SlmInference — higher-level prefill/generate loop over a SlmContext; includes save/rollback for KV-cache branching.
  • SlmHfModel — thin helper that downloads (or returns a cached) GGUF file from Hugging Face Hub.

Concrete backends (e.g. slm_llama, slm_ikllama) implement these traits against their own FFI layers.

Oracle Layer

SlmSimpleOracle<I, F> wraps an SlmInference and an SlmFormatter to provide a turn-aware conversational interface. Each ask/think call saves the KV-cache beforehand and rolls it back after generation, so the context (system prompt + injected history) is never contaminated by the answer.

let context = /* build SlmContext from backend */;
let formatter = SlmDynamicFormatter::try_from("gemma4")?;
let mut oracle = SlmSimpleOracle::new(context, formatter)?;

oracle.system("You are a precise QA tool.")?;
oracle.user("Some background text...")?;    // inject context without generating

let answer = oracle.ask("What is X?", None)?;   // plain generation
let answer = oracle.think("Reason about X", None)?; // chain-of-thought

println!("{}", answer);                     // final answer text
println!("{:?}", answer.thought());         // Option<&str> — reasoning trace

SlmOracle methods

Method Description
system(text) Prefill a system turn
user(text) Prefill a user turn without generating
assistant(text) Prefill an assistant turn (history injection)
tool(name, text) Prefill a tool-response turn without generating
ask(text, brake) Generate a reply to text; context rolls back after
think(text, brake) Like ask, but injects the reasoning trigger so the model produces chain-of-thought
generate(role, text, think, brake) Low-level entry point for the above
clear() Reset context and turn state

Formatter Layer

SlmFormatter abstracts chat-template formatting per model family.

pub trait SlmFormatter {
    fn bos(&self) -> Option<&str>;
    fn turn_start(&self, role: &SlmRole) -> String;
    fn turn_end(&self, role: &SlmRole) -> String;
    fn reasoning_bounds(&self) -> Option<(&str, &str)>;  // e.g. Some(("<think>", "</think>"))
    fn reasoning_trigger(&self) -> Option<&str>;          // prefix injected to activate CoT
    fn wrap_reasoning(&self, content: &str) -> String;
    fn tool_style(&self) -> SlmToolStyle;
    fn format_tool_call(&self, name: &str, arguments_json: &str) -> String;
    fn format_tool_response(&self, tool_name: &str, content: &str) -> String;
    fn strip_tags(&self, text: &str) -> String;
    fn clean(&self, text: &str) -> String;        // strips reasoning blocks + tags
    fn strip_thought(&self, text: &str) -> (String, Option<String>); // separates answer from CoT
}

Tool styles

  • SlmToolStyle::Inline — tool calls and responses are embedded inside the assistant turn (e.g. Gemma 4).
  • SlmToolStyle::SeparateTurn — tool responses occupy a dedicated turn with their own turn_start/turn_end (e.g. Llama 3).

Built-in formatters (slm_inference::models)

Key Type Thinking Tool style
"llama3" Llama3Formatter SeparateTurn
"gemma4" GemmaFormatter Inline
"gemma4-google" GemmaFormatter (Google template) Inline
"gemma4-unsloth" GemmaFormatter (unsloth fixed) Inline
"mistral" MistralFormatter (v3 Tekken) SeparateTurn
"mistral-legacy" MistralFormatter (legacy) SeparateTurn
"qwen25" Qwen25Formatter SeparateTurn
"phi4" Phi4Formatter SeparateTurn

Use SlmDynamicFormatter::try_from("gemma4") to select at runtime by name.

Roles

pub enum SlmRole {
    System,
    User,
    Assistant,
    Tool(String),   // carries the tool name
}

Helper constructors: SlmRole::tool("calculator"), role.tool_name(), role.is_tool().

Generation Control

SlmBrake controls when generation stops. Brake functions have the signature:

FnMut(answer: &str, last_token: &str, n_tokens: usize, fork_id: usize) -> SlmBrake
Variant Effect
Continue Keep generating
Finish Stop and return SlmAnswer::Complete
Stop Stop and return SlmAnswer::Incomplete
Delay Emit current token as SlmAnswer::Partial, pause
Next Defers decision to the next brake in the chain

Built-in factory:

SlmBrake::token_limit(512)   // stop after N tokens

Answer

SlmAnswer wraps the generated text with its completion state and optional reasoning trace:

pub enum SlmAnswer {
    Complete(String, fork_id, Option<String>),  // answer + CoT thought
    Partial(String, fork_id),
    Incomplete(String, fork_id),
}
Method Returns
answer.as_str() / Deref Final answer text
answer.thought() Option<&str> — chain-of-thought content
answer.is_complete() true if generation finished normally
answer.fork_id() Sequence ID in the KV cache