Oxide-rs
Fast AI Inference Library & CLI in Rust — A lightweight, CPU-based LLM inference engine inspired by llama.cpp.
Features
- GGUF Model Support — Load quantized models in GGUF format
- Full Tokenizer Compatibility — Supports all llama.cpp tokenizer types via shimmytok (SPM, BPE, WPM, UGM, RWKV)
- Automatic Chat Templates — Uses Jinja templates embedded in GGUF files via minijinja
- Streaming Output — Real-time token generation with tokens-per-second metrics
- Multiple Sampling Strategies — Temperature, top-k, top-p, and argmax sampling
- Repeat Penalty — Prevents repetitive output with configurable penalty window
- Interactive REPL — Full conversation mode with session history
- One-Shot Mode — Non-interactive generation for scripting/pipelines
- Beautiful CLI — Animated loading, syntax-highlighted output, Rust-themed
- Smart Defaults — Default system prompt reduces hallucinations, temperature tuned for accuracy
- Model Warmup — Pre-compiles compute kernels on startup for faster first-token generation
- Memory-Mapped Loading — OS-managed paging for instant load times and lower memory usage
- Auto Thread Tuning — Automatically detects optimal thread count for your CPU
- Pre-allocated Buffers — Zero-copy runtime allocations for smooth generation
- Tokenizer Caching — Caches tokenizer to disk for faster subsequent loads
- Page Prefetching — Preloads hot model pages into memory for faster first-token
Installation
Prerequisites
- Rust 1.70+ (2021 edition)
- A GGUF quantized model file with embedded chat template
Build from Source
# Clone the repository
# Build release binary
# Or using cargo directly
Install Locally
# Installs to ~/.local/bin/oxide-rs
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL |
~/Models/LFM2.5-1.2B-Instruct-Q4_K_M.gguf |
Path to GGUF model for make run |
# Set custom model path for make run
# Or inline
MODEL=/Models/phi-3.Q4_K_M.gguf
Quick Start
# Interactive chat mode (uses default helpful system prompt)
# With custom system prompt
# One-shot generation
# With custom sampling parameters
CLI Reference
| Flag | Default | Description |
|---|---|---|
-m, --model |
required | Path to GGUF model file |
-t, --tokenizer |
auto | Path to tokenizer.json (extracted from GGUF if omitted) |
-s, --system |
auto | System prompt (defaults to helpful assistant prompt) |
--max-tokens |
512 |
Maximum tokens to generate |
--temperature |
0.3 |
Sampling temperature (0.0 = greedy/argmax) |
--top-k |
none | Top-k sampling threshold |
--top-p |
none | Nucleus sampling threshold |
--repeat-penalty |
1.1 |
Penalty for repeated tokens |
--repeat-last-n |
64 |
Context window for repeat penalty |
--seed |
299792458 |
Random seed for reproducibility |
--threads |
auto | Number of threads for inference (auto-detects optimal) |
-p, --prompt |
none | Input prompt (for one-shot mode) |
-o, --once |
false |
Run in non-interactive mode |
Interactive Commands
| Command | Description |
|---|---|
/clear |
Clear conversation history for current session |
/exit or /quit |
Exit the program |
/help |
Show available commands |
Chat Templates
Oxide automatically uses the chat template embedded in GGUF files:
graph LR
A[GGUF File] --> B[tokenizer.chat_template]
B --> C[minijinja]
C --> D[Rendered Prompt]
How It Works
- Extraction — Reads
tokenizer.chat_templatefrom GGUF metadata - Rendering — Uses minijinja to render Jinja2 templates
- Multi-turn — Maintains conversation history within the session
Example Template (ChatML)
{% for message in messages %}
<|im_start|>{{ message.role }}
{{ message.content }}<|im_end|>
{% endfor %}
<|im_start|>assistant
Supported Models
Any GGUF model with an embedded tokenizer.chat_template will work automatically:
| Model Family | Template Source |
|---|---|
| LLaMA 3.x | Embedded in GGUF |
| Mistral | Embedded in GGUF |
| Qwen | Embedded in GGUF |
| Gemma | Embedded in GGUF |
| Phi-3 | Embedded in GGUF |
| SmolLM | Embedded in GGUF |
| LFM | Embedded in GGUF |
Note: If your GGUF file lacks a chat template, Oxide will error and ask you to use a model with an embedded template.
Supported Models
Model Architectures
Oxide uses Candle for inference:
- LLaMA — LLaMA 2, LLaMA 3, Mistral, Qwen, Phi, etc.
- LFM2 — Liquid Foundation Models
Note: Any GGUF model with LLaMA-compatible or LFM2 architecture should work.
Tokenizer Support
Oxide uses shimmytok for tokenizer support, providing 100% llama.cpp compatibility:
| Tokenizer Type | Description |
|---|---|
| SPM | SentencePiece (LLaMA, Mistral, etc.) |
| BPE | Byte-Pair Encoding (GPT-2 style) |
| WPM | WordPiece Model (BERT style) |
| UGM | Unigram Model |
| RWKV | RWKV tokenizers |
Architecture
graph TB
subgraph CLI["CLI Layer (clap)"]
args[Argument Parser]
end
subgraph TUI["Terminal UI (crossterm)"]
banner[Animated Banner]
loader[Ferris Loader]
stream[Streaming Output]
theme[Color Theme]
end
subgraph Inference["Inference Layer"]
generator[Generator]
template[Chat Template]
sampler[Logits Processor]
callback[StreamEvent Callback]
end
subgraph Model["Model Layer"]
model[Model Weights]
tokenizer[Tokenizer Wrapper]
end
subgraph Core["Core Dependencies"]
candle[Candle Transformers]
shimmytok[shimmytok]
minijinja[minijinja]
end
args --> generator
generator --> template
generator --> sampler
generator --> model
generator --> tokenizer
generator --> callback
callback --> stream
template --> minijinja
model --> candle
tokenizer --> shimmytok
banner --> theme
loader --> theme
stream --> theme
Data Flow
sequenceDiagram
participant User
participant CLI
participant Generator
participant Template
participant Model
participant Tokenizer
participant Stream
User->>CLI: Enter prompt
CLI->>Generator: generate(prompt, config)
Generator->>Template: apply(messages)
Template-->>Generator: formatted prompt
Generator->>Tokenizer: encode(prompt)
Tokenizer-->>Generator: tokens[]
Generator->>Model: forward(tokens)
Model-->>Generator: logits
loop For each token
Generator->>Generator: sample(logits)
Generator->>Tokenizer: decode_next(token)
Tokenizer-->>Generator: text
Generator->>Stream: StreamEvent::Token(text)
Stream->>User: Display token
end
Generator->>Stream: StreamEvent::Done
Stream->>User: Show stats (tok/s)
Development
# Development build (faster compile)
# Run with model (uses MODEL env var)
# Or override MODEL inline
# Format code
# Run linter
# Clean build artifacts
Dependencies
| Crate | Purpose |
|---|---|
candle-core |
Tensor operations, ML primitives |
candle-nn |
Neural network layers |
candle-transformers |
Pre-built model architectures (LLaMA, LFM2) |
shimmytok |
GGUF tokenizer (100% llama.cpp compatible) |
minijinja |
Jinja2 template engine for chat templates |
clap |
CLI argument parsing with derive macros |
crossterm |
Cross-platform terminal control |
anyhow |
Ergonomic error handling |
serde |
Serialization for messages |
tracing |
Structured logging |
Performance
- CPU-only inference — No GPU dependencies, portable binaries
- Quantized models — Q4_K_M provides good quality/speed tradeoff; other quantizations supported
- Streaming decode — Tokens displayed as generated for responsive UX
- Context caching — Efficient multi-turn conversations with token history management
- Model warmup — Pre-compiles compute kernels on startup for faster first-token generation
- Smart defaults — Temperature 0.3 for factual accuracy, default system prompt reduces hallucinations
Roadmap
- Multi-modal support
- OpenAI-compatible API server
- Model download/management
License
MIT License — see LICENSE for details.