<div align="center">
<img src="assets/shimmy-logo.png" alt="Shimmy Logo" width="300" height="auto" />
# The Lightweight OpenAI API Server
### ๐ Local Inference Without Dependencies ๐
[](https://opensource.org/licenses/MIT)
[](https://github.com/Michael-A-Kuykendall/shimmy/security)
[](https://crates.io/crates/shimmy)
[](https://crates.io/crates/shimmy)
[](https://rustup.rs/)
[](https://github.com/Michael-A-Kuykendall/shimmy/stargazers)
[](https://github.com/sponsors/Michael-A-Kuykendall)
**Languages:** [็ฎไฝไธญๆ](docs/zh-CN/README.md) ยท [็น้ซไธญๆ](docs/zh-TW/README.md)
</div>
**Shimmy will be free forever.** No asterisks. No "free for now." No pivot to paid.
### ๐ Support Shimmy's Growth
๐ **If Shimmy helps you, consider [sponsoring](https://github.com/sponsors/Michael-A-Kuykendall) โ 100% of support goes to keeping it free forever.**
- **$5/month**: Coffee tier โ - Eternal gratitude + sponsor badge
- **$25/month**: Bug prioritizer ๐ - Priority support + name in [SPONSORS.md](SPONSORS.md)
- **$100/month**: Corporate backer ๐ข - Logo placement + monthly office hours
- **$500/month**: Infrastructure partner ๐ - Direct support + roadmap input
---
## Table of Contents
- [What Is Shimmy?](#drop-in-openai-api-replacement-for-local-llms)
- [๐ฅ Airframe Engine (v2.0)](#-airframe-engine)
- [๐ฏ Supported Models](#-supported-models)
- [๐ฆ Migrating from v1.x](#-migrating-from-v1x)
- [โก Quick Start (30 seconds)](#quick-start-30-seconds)
- [๐ OpenAI SDK Compatibility](#-compatible-with-openai-sdks-and-tools)
- [๐ง Extended Context](#-extended-context)
- [๐ฅ Download & Install](#-download--install)
- [๐ Integration Examples](#integration-examples)
- [๐ API Reference](#api-reference)
- [โ FAQ](#-faq)
- [๐๏ธ Technical Architecture](#technical-architecture)
- [๐ Documentation Hub](#-documentation-hub)
- [๐ Community & Support](#community--support)
- [โก Performance](#-performance-comparison)
- [License](#license--philosophy)
---
## Drop-in OpenAI API Replacement for Local LLMs
Shimmy is a **single-binary** that provides **100% OpenAI-compatible endpoints** for GGUF models. Point your existing AI tools to Shimmy and they just work โ locally, privately, and free.
**๐ NEW in v2.0.0**: Shimmy now runs on [Airframe](#-airframe-engine), a pure-Rust WGSL GPU engine. No C++ toolchain, no backend flags, no compilation required.
## ๐ฅ Airframe Engine
Starting in v2.0.0, Shimmy's default inference engine is **Airframe** โ a pure-Rust WebGPU (WGSL) transformer runtime built from scratch.
**Why this matters:**
- No C++ toolchain required โ Rust only, top to bottom
- F32 precision throughout for deterministic, high-quality output
- WGSL compute shaders work on any GPU via WebGPU (NVIDIA, AMD, Intel, integrated)
- Model spec auto-derived from GGUF metadata โ no hardcoded per-model constants
- YaRN RoPE scaling for extended context via `SHIMMY_MAX_CTX` โ engine allocates KV cache and sets RoPE scale automatically (see [Extended Context](#-extended-context) below)
**Quick start with Airframe (v2.0.0+):**
```bash
# Default: 2048-token context
SHIMMY_BASE_GGUF=/path/to/TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf ./shimmy serve
# Extended context (4096 tokens โ YaRN RoPE enabled automatically, KV cache resized)
SHIMMY_BASE_GGUF=/path/to/model.gguf SHIMMY_MAX_CTX=4096 ./shimmy serve
```
## ๐ฏ Supported Models
Airframe v2.0 ships with GPU-verified support across **7 model architectures** and **5 quantization types**, covering the models most commonly run on consumer hardware. Context window is read directly from each model's GGUF metadata โ no hardcoded limits.
### โ
Locally Validated (GPU Math Verified)
| [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) | Llama | Q4_0 | 638 MB | 2048 | ~800 MB |
| [Llama-3.2-1B-Instruct](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF) | Llama | Q4_K_M | 770 MB | 131072* | ~1 GB |
| [Llama-3.2-3B-Instruct](https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF) | Llama | Q4_K_M | 1.9 GB | 131072* | ~2.5 GB |
| [phi-2](https://huggingface.co/TheBloke/phi-2-GGUF) | Phi-2 | Q4_K_M | 1.7 GB | 2048 | ~2.2 GB |
| [gemma-2-2b-it](https://huggingface.co/bartowski/gemma-2-2b-it-GGUF) | Gemma-2 | Q4_K_M | 1.6 GB | 8192 | ~2 GB |
| [starcoder2-3b](https://huggingface.co/second-state/StarCoder2-3B-GGUF) | StarCoder2 | Q4_K_M | 1.8 GB | 16384 | ~2.3 GB |
| [gpt2](https://huggingface.co/ggerganov/ggml/blob/main/gpt-2-117M-q4_0.bin) | GPT-2 | Q4_K_M | 107 MB | 1024 | ~200 MB |
> \* Llama-3.2's native context is 131072 tokens. Airframe reads this from GGUF and allocates KV cache accordingly. Use `SHIMMY_MAX_CTX=8192` for a practical 8K window on consumer hardware (~256 MB KV cache for the 1B model).
**GPU Math Verified** means the Airframe GPU dequantization shader produces results matching the CPU reference implementation, independently confirmed for every tensor type in each model. This is done via `quant_verify`, which tests 512 elements per quantization type per model.
### โณ Roadmap โ Larger Models (Require โฅ16 GB VRAM)
| deepseek-coder-6.7b-instruct | Llama | Q4_K_M | 3.9 GB | Pending remote GPU validation |
| deepseek-llm-7b-chat | Llama | Q4_K_M | 4.0 GB | Pending remote GPU validation |
| qwen2-7b-instruct | Qwen2 | Q4_K_M | 4.5 GB | Pending remote GPU validation |
| Phi-3.5-mini-instruct | Phi-3 | Q4_K_M | 2.3 GB | Requires fused QKV support (planned) |
### โ
Supported Quantization Types
| `F32` | 0 | Raw floats โ maximum precision |
| `F16` | 1 | Half-precision floats |
| `Q4_0` | 2 | 4-bit, 32-element blocks |
| `Q8_0` | 8 | 8-bit, 32-element blocks |
| `Q4_K` | 12 | 4-bit K-quant superblocks (256 elements) โ used in Q4_K_M GGUFs |
| `Q5_K` | 13 | 5-bit K-quant superblocks โ used alongside Q4_K in mixed-precision models |
| `Q6_K` | 14 | 6-bit K-quant superblocks โ typically used for output/embedding layers |
All types are implemented in both the GPU inference shader and a CPU reference implementation. GPU vs CPU agreement is validated for every type.
**Auto-discovery is enabled.** If Shimmy finds GGUF models in your HuggingFace cache, Ollama directory, LM Studio cache (`~/.cache/lm-studio/models`), or local `./models/` folder, it will register and serve them automatically. See [docs/MODEL_EXPANSION.md](docs/MODEL_EXPANSION.md) for the full compatibility matrix.
## ๐ฆ Migrating from v1.x
The llama.cpp backend is **removed in v2.0.0**. The Airframe engine is the only inference path.
See [docs/MIGRATION_v2.md](docs/MIGRATION_v2.md) for the step-by-step migration guide.
## Developer Tools
Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.
### Try it in 30 seconds
```bash
# 1) Download pre-built binary
# Windows:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
set SHIMMY_BASE_GGUF=C:\path\to\model.gguf && ./shimmy.exe serve &
# Linux / macOS:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
SHIMMY_BASE_GGUF=/path/to/model.gguf ./shimmy serve &
# 2) See registered models
./shimmy list
# 3) Smoke test the OpenAI API
curl -s http://127.0.0.1:11435/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"tinyllama-1.1b",
"messages":[{"role":"user","content":"Say hi in 5 words."}],
"max_tokens":32
}' | jq -r '.choices[0].message.content'
```
## ๐ Compatible with OpenAI SDKs and Tools
**No code changes needed** - just change the API endpoint:
- **Any OpenAI client**: Python, Node.js, curl, etc.
- **Development applications**: Compatible with standard SDKs
- **VSCode Extensions**: Point to `http://localhost:11435`
- **Cursor Editor**: Built-in OpenAI compatibility
- **Continue.dev**: Drop-in model provider
### Use with OpenAI SDKs
- Node.js (openai v4)
```ts
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://127.0.0.1:11435/v1",
apiKey: "sk-local", // placeholder, Shimmy ignores it
});
const resp = await openai.chat.completions.create({
model: "REPLACE_WITH_MODEL",
messages: [{ role: "user", content: "Say hi in 5 words." }],
max_tokens: 32,
});
console.log(resp.choices[0].message?.content);
```
- Python (openai>=1.0.0)
```python
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")
resp = client.chat.completions.create(
model="REPLACE_WITH_MODEL",
messages=[{"role": "user", "content": "Say hi in 5 words."}],
max_tokens=32,
)
print(resp.choices[0].message.content)
```
## โก Zero Configuration Required
- **Automatically finds models** from Hugging Face cache, Ollama, LM Studio (`~/.cache/lm-studio/models`), and local dirs
- **Auto-allocates ports** to avoid conflicts
- **Auto-detects LoRA adapters** for specialized models
- **Just works** - no config files, no setup wizards
## ๐ง Advanced MOE (Mixture of Experts) Support
> **Note**: MoE (Mixture of Experts) CPU offloading is on the Airframe roadmap. See [docs/AIRFRAME_MOE_ROADMAP.md](docs/AIRFRAME_MOE_ROADMAP.md) for the implementation plan.
**Run 70B+ models on consumer hardware** โ coming to the Airframe engine. Track progress on the [roadmap](docs/ROADMAP.md).
**Perfect for**: Large models (70B+), limited VRAM systems, cost-effective inference
## ๐ฏ Perfect for Local Development
- **Privacy**: Your code never leaves your machine
- **Cost**: No API keys, no per-token billing
- **Speed**: Local inference, sub-second responses
- **Reliability**: No rate limits, no downtime
## Quick Start (30 seconds)
### Installation
**v2.0.0**: Download pre-built binaries with Airframe WebGPU engine included!
#### **๐ฅ Pre-Built Binaries (Recommended โ Zero Dependencies)**
Pick your platform and download โ no compilation needed, GPU acceleration included:
```bash
# Windows x64 (Airframe WebGPU engine)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
# Linux x86_64 (Airframe WebGPU engine)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
# macOS ARM64 (Airframe with Metal backend via wgpu)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
# macOS Intel
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy
# Linux ARM64 (huggingface engine; Airframe cross-compilation not yet supported)
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy
```
**That's it!** The Airframe WebGPU adapter is selected automatically at runtime.
#### **๐ ๏ธ Build from Source / cargo install**
```bash
# Install from crates.io (huggingface engine โ works without GPU)
cargo install shimmy
# Build from source with Airframe GPU engine (requires airframe submodule)
git clone https://github.com/Michael-A-Kuykendall/shimmy --recurse-submodules
cd shimmy
cargo build --release --features airframe,huggingface
```
> **Note**: The GitHub Releases binaries already include the Airframe engine. Building from source with `--features airframe` is for contributors or custom builds.
### GPU Acceleration
**v2.0.0**: Airframe uses **WebGPU (wgpu)** for GPU acceleration. No backend flags, no driver installation beyond standard OS graphics drivers.
#### **๐ฅ Download Pre-Built Binaries (Recommended)**
Release binaries include the Airframe engine with WebGPU support compiled in:
| **Windows x64** | [shimmy-windows-x86_64.exe](https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe) | WebGPU (wgpu) | NVIDIA, AMD, Intel |
| **Linux x86_64** | [shimmy-linux-x86_64](https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64) | WebGPU (wgpu) | NVIDIA, AMD, Intel |
| **macOS ARM64** | [shimmy-macos-arm64](https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64) | Metal (via wgpu) | Apple Silicon |
| **macOS Intel** | [shimmy-macos-intel](https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel) | Metal (via wgpu) | Intel Mac |
| **Linux ARM64** | [shimmy-linux-aarch64](https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64) | huggingface only | ARM cross-build |
#### **๐ฏ How GPU Selection Works**
Airframe uses wgpu's adapter enumeration. On first launch it selects the best available GPU adapter for your system โ discrete GPU preferred over integrated, integrated over CPU fallback. No configuration needed.
```bash
# Check selected adapter
shimmy gpu-info
# Start serving (GPU adapter auto-selected)
shimmy serve
```
#### **๐ง Extended Context**
`SHIMMY_MAX_CTX` overrides the context window at the engine level. When set above the model's native window, Airframe automatically engages YaRN RoPE scaling and resizes the KV cache accordingly.
```bash
# 4096-token context with YaRN (2x native window for TinyLlama)
SHIMMY_BASE_GGUF=/path/to/model.gguf SHIMMY_MAX_CTX=4096 shimmy serve
# 8192 tokens (4x native, higher RoPE compression)
SHIMMY_BASE_GGUF=/path/to/model.gguf SHIMMY_MAX_CTX=8192 shimmy serve
```
> **Note:** Extended context beyond 4096 is functional but not yet as deeply validated as the native 2048-token window. Accepted range is 512โ131072. Values outside that range are silently ignored and 2048 is used.
#### **๐พ VRAM Sizing Reference**
Airframe allocates VRAM at load time: **weights** + **KV cache**. The KV cache is F32 and scales linearly with context length (`n_layers ร n_kv_heads ร head_dim ร ctx ร 2 ร 4 bytes`).
**TinyLlama 1.1B Q4_0 โ the v2.0 validated path:**
| 2048 (default) | ~88 MB | ~638 MB | ~726 MB | **~800 MB** |
| 4096 | ~176 MB | ~638 MB | ~814 MB | **~900 MB** |
| 8192 | ~352 MB | ~638 MB | ~990 MB | **~1.1 GB** |
| 16384 | ~704 MB | ~638 MB | ~1.3 GB | **~1.5 GB** |
> Integrated graphics (Intel Iris, Apple M-series unified memory, AMD Vega) running at 2048 context is ~800 MB โ comfortably inside the 2 GB allocation most integrated GPUs share with system RAM.
**Scaling up to larger models** (architecture and quant support required โ see [docs/MODEL_EXPANSION.md](docs/MODEL_EXPANSION.md)):
| Llama 3.2 1B | Q4_0 | ~636 MB | ~128 MB | ~900 MB |
| Llama 3.2 3B | Q4_0 | ~1.9 GB | ~448 MB | ~2.5 GB |
| Mistral 7B | Q4_K_M | ~4.1 GB | ~512 MB | ~5 GB |
| Llama 3 8B | Q4_K_M | ~4.7 GB | ~512 MB | ~5.5 GB |
The KV cache formula for any model: `n_layers ร n_kv_heads ร head_dim ร ctx ร 2 ร 4 bytes`. Multiply the 2048 baseline by your `SHIMMY_MAX_CTX` multiplier to get the extended context allocation.
### Get Models
Shimmy auto-discovers models from:
- **Hugging Face cache**: `~/.cache/huggingface/hub/`
- **Ollama models**: `~/.ollama/models/`
- **Local directory**: `./models/`
- **Environment**: `SHIMMY_BASE_GGUF=path/to/model.gguf`
```bash
# Primary validated model for Airframe v2.0
huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
--include "tinyllama-1.1b-chat-v1.0.Q4_0.gguf" --local-dir ./models/
# Alternative 1B โ also fits in the same hardware envelope
huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF \
--include "*Q4_K_M*" --local-dir ./models/
```
### Start Server
```bash
# Auto-allocates port to avoid conflicts
shimmy serve
# Or use manual port
shimmy serve --bind 127.0.0.1:11435
```
Point your development tools to the displayed port โ VSCode Copilot, Cursor, Continue.dev all work instantly.
## ๐ฆ Download & Install
### Package Managers
- **Rust**: [`cargo install shimmy`](https://crates.io/crates/shimmy) *(installs huggingface engine; for Airframe GPU, use GitHub Releases binaries)*
- **VS Code**: [Shimmy Extension](https://marketplace.visualstudio.com/items?itemName=targetedwebresults.shimmy-vscode)
- **npm**: `npm install -g shimmy-js` *(planned)*
- **Python**: `pip install shimmy` *(planned)*
### Direct Downloads
- **GitHub Releases**: [Latest binaries](https://github.com/Michael-A-Kuykendall/shimmy/releases/latest)
- **Docker**: `docker pull shimmy/shimmy:latest` *(coming soon)*
### ๐ macOS Support
**Full compatibility confirmed!** Shimmy works on macOS with Metal GPU acceleration via wgpu.
```bash
# Install from crates.io (huggingface engine)
cargo install shimmy
# For Airframe GPU engine, download the macOS binary from GitHub Releases:
curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
```
**โ
Verified working:**
- Intel and Apple Silicon Macs
- Metal GPU acceleration via wgpu (automatic on Apple Silicon)
- Xcode 17+ compatibility
## Integration Examples
### VSCode Copilot
```json
{
"github.copilot.advanced": {
"serverUrl": "http://localhost:11435"
}
}
```
### Continue.dev
```json
{
"models": [{
"title": "Local Shimmy",
"provider": "openai",
"model": "your-model-name",
"apiBase": "http://localhost:11435/v1"
}]
}
```
### Cursor IDE
Works out of the box - just point to `http://localhost:11435/v1`
## Why Shimmy Will Always Be Free
I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.
**This is my commitment**: Shimmy stays MIT licensed, forever. If you want to support development, [sponsor it](https://github.com/sponsors/Michael-A-Kuykendall). If you don't, just build something cool with it.
> ๐ก **Shimmy saves you time and money. If it's useful, consider [sponsoring for $5/month](https://github.com/sponsors/Michael-A-Kuykendall) โ less than your Netflix subscription, infinitely more useful for developers.**
## API Reference
### Endpoints
- `GET /health` - Health check
- `POST /v1/chat/completions` - OpenAI-compatible chat (streaming supported)
- `POST /v1/completions` - OpenAI-compatible text completions
- `GET /v1/models` - List available models
- `POST /api/generate` - Shimmy native API
- `GET /ws/generate` - WebSocket streaming
### Environment Variables
| `SHIMMY_BASE_GGUF` | *(auto-discover)* | Path to GGUF model file loaded as the default model |
| `SHIMMY_PORT` | `8080` | Port to listen on (Airframe server binary) |
| `SHIMMY_BIND_ADDRESS` | `0.0.0.0:8080` | Full bind address (overrides port) |
| `SHIMMY_MAX_CTX` | *(from GGUF)* | Override context window; activates YaRN RoPE scaling when above model native |
| `SHIMMY_MODEL_PATHS` | *(see Zero Config)* | Colon-separated extra model search paths |
| `SHIMMY_ENGINE_BACKEND` | `airframe` | `airframe` (default) or `llama` (legacy path) |
| `SHIMMY_ROPE_SCALE` | *(auto)* | Override computed YaRN scale factor |
| `RUST_BACKTRACE` | *(off)* | Set to `1` to print crash backtraces |
### CLI Commands
```bash
shimmy serve # Start server (auto port allocation)
shimmy serve --bind 127.0.0.1:8080 # Manual port binding
shimmy serve --gpu-backend auto # WebGPU adapter auto-select (default)
shimmy serve --gpu-backend cpu # Force CPU (disable GPU)
shimmy list # Show available models
shimmy discover # Refresh model discovery
shimmy generate --name X --prompt "Hi" # Test generation
shimmy probe model-name # Verify model loads
shimmy gpu-info # Show selected WebGPU adapter
```
## Technical Architecture
- **Rust + Tokio**: Memory-safe, async performance
- **Airframe engine**: Pure-Rust WGSL GPU inference โ no C++ toolchain, deterministic output, GGUF-native
- **OpenAI API compatibility**: Drop-in replacement
- **Dynamic port management**: Zero conflicts, auto-allocation
- **Zero-config auto-discovery**: Just worksโข
### ๐ Advanced Features
- **๐ง MOE CPU Offloading**: Hybrid GPU/CPU processing for large models (70B+)
- **๐ฏ Smart Model Filtering**: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)
- **๐ก๏ธ 6-Gate Release Validation**: Constitutional quality limits ensure reliability
- **โก Smart Model Preloading**: Background loading with usage tracking for instant model switching
- **๐พ Response Caching**: LRU + TTL cache delivering 20-40% performance gains on repeat queries
- **๐ Integration Templates**: One-command deployment for Docker, Kubernetes, Railway, Fly.io, FastAPI, Express
- **๐ Request Routing**: Multi-instance support with health checking and load balancing
- **๐ Advanced Observability**: Real-time metrics with self-optimization and Prometheus integration
- **๐ RustChain Integration**: Universal workflow transpilation with workflow orchestration
---
## โ FAQ
**Does Shimmy work on my GPU?**
Shimmy uses WebGPU (via the Airframe engine) which runs on Vulkan, D3D12, and Metal โ covering NVIDIA, AMD, Intel, and Apple Silicon. No CUDA required. See [GPU requirements in TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) if you hit adapter errors.
**What's the difference between Shimmy and llama.cpp / Ollama?**
Shimmy is written in pure Rust with no C++ toolchain dependency. The Airframe engine runs WGSL compute shaders compiled at startup โ no pre-built binaries, no driver version pinning. The result is faster startup, lower memory overhead, and deterministic output. See the [GPU Pipeline doc](docs/GPU_PIPELINE.md) for internals.
**Why do I need `SHIMMY_BASE_GGUF` or `LIBSHIMMY_MODEL_PATH`?**
If you don't set these, Shimmy auto-discovers models in standard directories (`~/.cache/huggingface`, `~/.ollama`, `~/lm-studio/models`, `~/.cache/lm-studio/models`, `~/Library/Application Support/LMStudio`). Set `SHIMMY_BASE_GGUF` to override and point directly at a specific GGUF file.
**Can I run multiple models at once?**
Not currently โ Shimmy loads one model per server instance. To serve multiple models, run multiple server instances on different ports. Hot-swapping models (reload without restart) is on the roadmap.
**Why does generation stop before `max_tokens`?**
The model reached a natural end-of-sequence token. For chat models this is expected behavior โ the model signals it's done. If you want to force longer output, increase `max_tokens` and set `temperature > 0`. If generation stops on the wrong token, the model may be using the wrong chat template โ see [CHAT_TEMPLATES.md](docs/CHAT_TEMPLATES.md).
**Is there streaming support?**
Set `"stream": true` in your request. Shimmy returns Server-Sent Events in the standard OpenAI streaming format.
**Q4_K_M vs Q4_0 โ which should I use?**
`Q4_K_M` (K-quant) is consistently better quality than `Q4_0` for the same file size. Use `Q4_0` only when you need maximum compatibility or the model isn't available in K-quant. See [QUANTIZATION.md](docs/QUANTIZATION.md) for the full analysis.
**Can I extend the context window beyond what the model was trained on?**
Yes โ set `SHIMMY_MAX_CTX` to any value. Airframe applies YaRN scaling automatically when the requested context exceeds the model's native context. Quality degrades gradually beyond 2ร the native context. See [EXTENDED_CONTEXT.md](docs/EXTENDED_CONTEXT.md).
---
## ๐ Documentation Hub
Full documentation lives in [docs/](docs/). Use this table to find what you need:
### Getting Started
| [docs/quickstart.md](docs/quickstart.md) | 5-minute getting started guide |
| [docs/MIGRATION_v2.md](docs/MIGRATION_v2.md) | Migrating from Shimmy v1.x |
| [docs/CONFIGURATION.md](docs/CONFIGURATION.md) | All environment variables and config options |
| [docs/WINDOWS_GPU_BUILD_GUIDE.md](docs/WINDOWS_GPU_BUILD_GUIDE.md) | Windows-specific build instructions |
### API & Integration
| [docs/API.md](docs/API.md) | Complete API endpoint reference |
| [docs/OPENAI_COMPAT.md](docs/OPENAI_COMPAT.md) | OpenAI compatibility matrix โ what's supported |
| [docs/INTEGRATION.md](docs/INTEGRATION.md) | Integrating with LangChain, OpenAI SDKs, etc. |
| [docs/EXAMPLES.md](docs/EXAMPLES.md) | Runnable code examples |
| [docs/CROSS_COMPILATION.md](docs/CROSS_COMPILATION.md) | Building for other targets (ARM, Linux from Windows) |
### Engine Deep Dives
| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | System-level architecture and component map |
| [docs/GPU_PIPELINE.md](docs/GPU_PIPELINE.md) | Bindless GPU architecture, WGSL shaders, dispatch patterns |
| [docs/QUANTIZATION.md](docs/QUANTIZATION.md) | Q4_0, Q8_0, K-quant formats โ bit-level internals |
| [docs/EXTENDED_CONTEXT.md](docs/EXTENDED_CONTEXT.md) | YaRN RoPE scaling, VRAM math, context extension |
| [docs/CHAT_TEMPLATES.md](docs/CHAT_TEMPLATES.md) | Chat template auto-detection and format reference |
| [docs/MODEL_EXPANSION.md](docs/MODEL_EXPANSION.md) | Model onboarding protocol and acceptance gates |
### Troubleshooting & Reference
| [docs/TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) | Diagnostic guide for GPU errors, model failures, port conflicts |
| [docs/PERFORMANCE.md](docs/PERFORMANCE.md) | Performance tuning and token/sec benchmarks |
| [docs/FEATURES.md](docs/FEATURES.md) | Complete feature list |
### Development & Methodology
| [docs/METHODOLOGY.md](docs/METHODOLOGY.md) | Engineering methodology and quality standards |
| [docs/REGRESSION_TESTING.md](docs/REGRESSION_TESTING.md) | Regression testing approach |
| [docs/ppt-invariant-testing.md](docs/ppt-invariant-testing.md) | Property-based and invariant testing details |
| [docs/METRICS.md](docs/METRICS.md) | Observability and metrics reference |
---
## Community & Support
- **๐ Bug Reports**: [GitHub Issues](https://github.com/Michael-A-Kuykendall/shimmy/issues)
- **๐ฌ Discussions**: [GitHub Discussions](https://github.com/Michael-A-Kuykendall/shimmy/discussions)
- **๐ Documentation**: [Full Documentation Hub](#-documentation-hub) โข [Migration Guide v1โv2](docs/MIGRATION_v2.md) โข [Engineering Methodology](docs/METHODOLOGY.md) โข [OpenAI Compatibility Matrix](docs/OPENAI_COMPAT.md) โข [Architecture](docs/ARCHITECTURE.md) โข [GPU Pipeline](docs/GPU_PIPELINE.md) โข [Troubleshooting](docs/TROUBLESHOOTING.md)
- **๐ Sponsorship**: [GitHub Sponsors](https://github.com/sponsors/Michael-A-Kuykendall)
### Star History
[](https://www.star-history.com/#Michael-A-Kuykendall/shimmy&Timeline)
### ๐ Momentum Snapshot
๐ ** stars and climbing fast**
โฑ **<1s startup**
๐ฆ **100% Rust, no Python**
### ๐ฐ As Featured On
๐ฅ [**Hacker News**](https://news.ycombinator.com/item?id=45130322) โข [**Front Page Again**](https://news.ycombinator.com/item?id=45199898) โข [**IPE Newsletter**](https://ipenewsletter.substack.com/p/the-strange-new-side-hustles-of-openai)
**Companies**: Need invoicing? Email [michaelallenkuykendall@gmail.com](mailto:michaelallenkuykendall@gmail.com)
## โก Performance Comparison
| **Shimmy** | **<100ms** | **50MB** | **100%** |
| Ollama | 5-10s | 200MB+ | Partial |
## Quality & Reliability
Shimmy maintains high code quality through comprehensive testing:
- **Comprehensive test suite** with property-based testing
- **Automated CI/CD pipeline** with quality gates
- **Runtime invariant checking** for critical operations
- **Cross-platform compatibility testing**
### Development Testing
Run the complete test suite:
```bash
# Using cargo aliases
cargo test-quick # Quick development tests
# Using Makefile
make test # Full test suite
make test-quick # Quick development tests
```
See our [testing approach](docs/ppt-invariant-testing.md) for technical details.
---
## License & Philosophy
MIT License - forever and always.
**Philosophy**: Infrastructure should be invisible. Shimmy is infrastructure.
**Testing Philosophy**: Reliability through comprehensive validation and property-based testing.
---
**Forever maintainer**: Michael A. Kuykendall
**Promise**: This will never become a paid product
**Mission**: Making local model inference simple and reliable