Oxide-rs

Fast AI Inference Library & CLI in Rust — A lightweight, CPU-based LLM inference engine inspired by llama.cpp.

Features

GGUF Model Support — Load quantized models in GGUF format
Full Tokenizer Compatibility — Supports all llama.cpp tokenizer types via shimmytok (SPM, BPE, WPM, UGM, RWKV)
Automatic Chat Templates — Uses Jinja templates embedded in GGUF files via minijinja
Streaming Output — Real-time token generation with tokens-per-second metrics
Multiple Sampling Strategies — Temperature, top-k, top-p, and argmax sampling
Repeat Penalty — Prevents repetitive output with configurable penalty window
Interactive REPL — Full conversation mode with session history
One-Shot Mode — Non-interactive generation for scripting/pipelines
Beautiful CLI — Animated loading, syntax-highlighted output, Rust-themed
Smart Defaults — Default system prompt reduces hallucinations, temperature tuned for accuracy
Model Warmup — Pre-compiles compute kernels on startup for faster first-token generation
Memory-Mapped Loading — OS-managed paging for instant load times and lower memory usage
Auto Thread Tuning — Automatically detects optimal thread count for your CPU
Pre-allocated Buffers — Zero-copy runtime allocations for smooth generation
Tokenizer Caching — Caches tokenizer to disk for faster subsequent loads
Page Prefetching — Preloads hot model pages into memory for faster first-token

Installation

Prerequisites

Rust 1.70+ (2021 edition)
A GGUF quantized model file with embedded chat template

Build from Source

# Clone the repository
git clone https://github.com/yourusername/oxide-rs.git
cd oxide-rs

# Build release binary
make build

# Or using cargo directly
cargo build --release

Install Locally

make install
# Installs to ~/.local/bin/oxide-rs

Environment Variables

Variable	Default	Description
`MODEL`	`~/Models/LFM2.5-1.2B-Instruct-Q4_K_M.gguf`	Path to GGUF model for `make run`

# Set custom model path for make run
export MODEL=~/Models/mistral-7b-v0.1.Q4_K_M.gguf
make run

# Or inline
MODEL=~/Models/phi-3.Q4_K_M.gguf make run

Quick Start

# Interactive chat mode (uses default helpful system prompt)
./target/release/oxide-rs --model ~/Models/your-model-Q4_K_M.gguf

# With custom system prompt
./target/release/oxide-rs --model ~/Models/model.gguf --system "You are a Rust expert."

# One-shot generation
./target/release/oxide-rs --model ~/Models/model.gguf --once --prompt "Write a Rust function to reverse a string"

# With custom sampling parameters
./target/release/oxide-rs --model ~/Models/model.gguf \
  --temperature 0.8 \
  --top-k 40 \
  --top-p 0.9 \
  --repeat-penalty 1.15

CLI Reference

Flag	Default	Description
`-m, --model`	required	Path to GGUF model file
`-t, --tokenizer`	auto	Path to tokenizer.json (extracted from GGUF if omitted)
`-s, --system`	auto	System prompt (defaults to helpful assistant prompt)
`--max-tokens`	`512`	Maximum tokens to generate
`--temperature`	`0.3`	Sampling temperature (0.0 = greedy/argmax)
`--top-k`	none	Top-k sampling threshold
`--top-p`	none	Nucleus sampling threshold
`--repeat-penalty`	`1.1`	Penalty for repeated tokens
`--repeat-last-n`	`64`	Context window for repeat penalty
`--seed`	`299792458`	Random seed for reproducibility
`--threads`	auto	Number of threads for inference (auto-detects optimal)
`-p, --prompt`	none	Input prompt (for one-shot mode)
`-o, --once`	`false`	Run in non-interactive mode

Interactive Commands

Command	Description
`/clear`	Clear conversation history for current session
`/exit` or `/quit`	Exit the program
`/help`	Show available commands

Chat Templates

Oxide automatically uses the chat template embedded in GGUF files:

graph LR
    A[GGUF File] --> B[tokenizer.chat_template]
    B --> C[minijinja]
    C --> D[Rendered Prompt]

How It Works

Extraction — Reads tokenizer.chat_template from GGUF metadata
Rendering — Uses minijinja to render Jinja2 templates
Multi-turn — Maintains conversation history within the session

Example Template (ChatML)

{% for message in messages %}
<|im_start|>{{ message.role }}
{{ message.content }}<|im_end|>
{% endfor %}
<|im_start|>assistant

Supported Models

Any GGUF model with an embedded tokenizer.chat_template will work automatically:

Model Family	Template Source
LLaMA 3.x	Embedded in GGUF
Mistral	Embedded in GGUF
Qwen	Embedded in GGUF
Gemma	Embedded in GGUF
Phi-3	Embedded in GGUF
SmolLM	Embedded in GGUF
LFM	Embedded in GGUF

Note: If your GGUF file lacks a chat template, Oxide will error and ask you to use a model with an embedded template.

Supported Models

Model Architectures

Oxide uses Candle for inference:

LLaMA — LLaMA 2, LLaMA 3, Mistral, Qwen, Phi, etc.
LFM2 — Liquid Foundation Models

Note: Any GGUF model with LLaMA-compatible or LFM2 architecture should work.

Tokenizer Support

Oxide uses shimmytok for tokenizer support, providing 100% llama.cpp compatibility:

Tokenizer Type	Description
SPM	SentencePiece (LLaMA, Mistral, etc.)
BPE	Byte-Pair Encoding (GPT-2 style)
WPM	WordPiece Model (BERT style)
UGM	Unigram Model
RWKV	RWKV tokenizers

Architecture

graph TB
    subgraph CLI["CLI Layer (clap)"]
        args[Argument Parser]
    end

    subgraph TUI["Terminal UI (crossterm)"]
        banner[Animated Banner]
        loader[Ferris Loader]
        stream[Streaming Output]
        theme[Color Theme]
    end

    subgraph Inference["Inference Layer"]
        generator[Generator]
        template[Chat Template]
        sampler[Logits Processor]
        callback[StreamEvent Callback]
    end

    subgraph Model["Model Layer"]
        model[Model Weights]
        tokenizer[Tokenizer Wrapper]
    end

    subgraph Core["Core Dependencies"]
        candle[Candle Transformers]
        shimmytok[shimmytok]
        minijinja[minijinja]
    end

    args --> generator
    generator --> template
    generator --> sampler
    generator --> model
    generator --> tokenizer
    generator --> callback
    
    callback --> stream
    template --> minijinja
    
    model --> candle
    tokenizer --> shimmytok
    
    banner --> theme
    loader --> theme
    stream --> theme

Data Flow

sequenceDiagram
    participant User
    participant CLI
    participant Generator
    participant Template
    participant Model
    participant Tokenizer
    participant Stream

    User->>CLI: Enter prompt
    CLI->>Generator: generate(prompt, config)
    Generator->>Template: apply(messages)
    Template-->>Generator: formatted prompt
    Generator->>Tokenizer: encode(prompt)
    Tokenizer-->>Generator: tokens[]
    Generator->>Model: forward(tokens)
    Model-->>Generator: logits
    
    loop For each token
        Generator->>Generator: sample(logits)
        Generator->>Tokenizer: decode_next(token)
        Tokenizer-->>Generator: text
        Generator->>Stream: StreamEvent::Token(text)
        Stream->>User: Display token
    end
    
    Generator->>Stream: StreamEvent::Done
    Stream->>User: Show stats (tok/s)

Development

# Development build (faster compile)
make dev

# Run with model (uses MODEL env var)
make run

# Or override MODEL inline
make run MODEL=~/path/to/model.gguf

# Format code
make fmt

# Run linter
make check

# Clean build artifacts
make clean

Dependencies

Crate	Purpose
`candle-core`	Tensor operations, ML primitives
`candle-nn`	Neural network layers
`candle-transformers`	Pre-built model architectures (LLaMA, LFM2)
`shimmytok`	GGUF tokenizer (100% llama.cpp compatible)
`minijinja`	Jinja2 template engine for chat templates
`clap`	CLI argument parsing with derive macros
`crossterm`	Cross-platform terminal control
`anyhow`	Ergonomic error handling
`serde`	Serialization for messages
`tracing`	Structured logging

Performance

CPU-only inference — No GPU dependencies, portable binaries
Quantized models — Q4_K_M provides good quality/speed tradeoff; other quantizations supported
Streaming decode — Tokens displayed as generated for responsive UX
Context caching — Efficient multi-turn conversations with token history management
Model warmup — Pre-compiles compute kernels on startup for faster first-token generation
Smart defaults — Temperature 0.3 for factual accuracy, default system prompt reduces hallucinations

Roadmap

Multi-modal support
OpenAI-compatible API server
Model download/management

License

MIT License — see LICENSE for details.

Acknowledgments

Candle — HuggingFace's minimalist ML framework for Rust
llama.cpp — Inspiration and GGUF format specification
shimmytok — Pure Rust GGUF tokenizer with llama.cpp compatibility
minijinja — Minimal Jinja2 template engine for Rust

oxide-rs 0.1.0

Oxide-rs

Features

Installation

Prerequisites

Build from Source

Install Locally

Environment Variables

Quick Start

CLI Reference

Interactive Commands

Chat Templates

How It Works

Example Template (ChatML)

Supported Models

Supported Models

Model Architectures

Tokenizer Support

Architecture

Data Flow

Development

Dependencies

Performance

Roadmap

License

Acknowledgments