The Lightweight OpenAI API Server
๐ Local Inference Without Dependencies ๐
Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.
๐ Support Shimmy's Growth
๐ If Shimmy helps you, consider sponsoring โ 100% of support goes to keeping it free forever.
- $5/month: Coffee tier โ - Eternal gratitude + sponsor badge
- $25/month: Bug prioritizer ๐ - Priority support + name in SPONSORS.md
- $100/month: Corporate backer ๐ข - Logo placement + monthly office hours
- $500/month: Infrastructure partner ๐ - Direct support + roadmap input
๐ฏ Become a Sponsor | See our amazing sponsors ๐
Drop-in OpenAI API Replacement for Local LLMs
Shimmy is a 4.8MB single-binary that provides 100% OpenAI-compatible endpoints for GGUF models. Point your existing AI tools to Shimmy and they just work โ locally, privately, and free.
Developer Tools
Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.
Try it in 30 seconds
# 1) Install + run
&
# 2) See models and pick one
# 3) Smoke test the OpenAI API
|
๐ Compatible with OpenAI SDKs and Tools
No code changes needed - just change the API endpoint:
- Any OpenAI client: Python, Node.js, curl, etc.
- Development applications: Compatible with standard SDKs
- VSCode Extensions: Point to
http://localhost:11435 - Cursor Editor: Built-in OpenAI compatibility
- Continue.dev: Drop-in model provider
Use with OpenAI SDKs
- Node.js (openai v4)
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "http://127.0.0.1:11435/v1",
apiKey: "sk-local", // placeholder, Shimmy ignores it
});
const resp = await openai.chat.completions.create({
model: "REPLACE_WITH_MODEL",
messages: [{ role: "user", content: "Say hi in 5 words." }],
max_tokens: 32,
});
console.log(resp.choices[0].message?.content);
- Python (openai>=1.0.0)
=
=
โก Zero Configuration Required
- Automatically finds models from Hugging Face cache, Ollama, local dirs
- Auto-allocates ports to avoid conflicts
- Auto-detects LoRA adapters for specialized models
- Just works - no config files, no setup wizards
๐ง Advanced MOE (Mixture of Experts) Support
Run 70B+ models on consumer hardware with intelligent CPU/GPU hybrid processing:
- ๐ CPU MOE Offloading: Automatically distribute model layers across CPU and GPU
- ๐งฎ Intelligent Layer Placement: Optimizes which layers run where for maximum performance
- ๐พ Memory Efficiency: Fit larger models in limited VRAM by using system RAM strategically
- โก Hybrid Acceleration: Get GPU speed where it matters most, CPU reliability everywhere else
- ๐๏ธ Configurable:
--cpu-moeand--n-cpu-moeflags for fine control
# Enable MOE CPU offloading during installation
# Run with MOE hybrid processing
# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)
Perfect for: Large models (70B+), limited VRAM systems, cost-effective inference
๐ฏ Perfect for Local Development
- Privacy: Your code never leaves your machine
- Cost: No API keys, no per-token billing
- Speed: Local inference, sub-second responses
- Reliability: No rate limits, no downtime
Quick Start (30 seconds)
Installation
๐ช Windows
# RECOMMENDED: Use pre-built binary (no build dependencies required)
# OR: Install from source with MOE support
# First install build dependencies:
# Then install shimmy with MOE:
# For CUDA + MOE hybrid processing:
โ ๏ธ Windows Notes:
- Pre-built binary recommended to avoid build dependency issues
- MSVC compatibility: Uses
shimmy-llama-cpp-2packages for better Windows support- If Windows Defender flags the binary, add an exclusion or use
cargo install- For
cargo install: Install LLVM first to resolvelibclang.dllerrors
๐ macOS / ๐ง Linux
# Install from crates.io
GPU Acceleration
Shimmy supports multiple GPU backends for accelerated inference:
๐ฅ๏ธ Available Backends
| Backend | Hardware | Installation |
|---|---|---|
| CUDA | NVIDIA GPUs | cargo install shimmy --features llama-cuda |
| CUDA + MOE | NVIDIA GPUs + CPU | cargo install shimmy --features llama-cuda,moe |
| Vulkan | Cross-platform GPUs | cargo install shimmy --features llama-vulkan |
| OpenCL | AMD/Intel/Others | cargo install shimmy --features llama-opencl |
| MLX | Apple Silicon | cargo install shimmy --features mlx |
| MOE Hybrid | Any GPU + CPU | cargo install shimmy --features moe |
| All Features | Everything | cargo install shimmy --features gpu,moe |
๐ Check GPU Support
# Show detected GPU backends
โก Usage Notes
- GPU backends are automatically detected at runtime
- Falls back to CPU if GPU is unavailable
- Multiple backends can be compiled in, best one selected automatically
- Use
--gpu-backend <backend>to force specific backend
Get Models
Shimmy auto-discovers models from:
- Hugging Face cache:
~/.cache/huggingface/hub/ - Ollama models:
~/.ollama/models/ - Local directory:
./models/ - Environment:
SHIMMY_BASE_GGUF=path/to/model.gguf
# Download models that work out of the box
Start Server
# Auto-allocates port to avoid conflicts
# Or use manual port
Point your development tools to the displayed port โ VSCode Copilot, Cursor, Continue.dev all work instantly.
๐ฆ Download & Install
Package Managers
- Rust:
cargo install shimmy --features moe(recommended) - Rust (basic):
cargo install shimmy - VS Code: Shimmy Extension
- Windows MSVC: Uses
shimmy-llama-cpp-2packages for better compatibility - npm:
npm install -g shimmy-js(planned) - Python:
pip install shimmy(planned)
Direct Downloads
- GitHub Releases: Latest binaries
- Docker:
docker pull shimmy/shimmy:latest(coming soon)
๐ macOS Support
Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.
# Install dependencies
# Install shimmy
โ Verified working:
- Intel and Apple Silicon Macs
- Metal GPU acceleration (automatic)
- MLX native acceleration for Apple Silicon
- Xcode 17+ compatibility
- All LoRA adapter features
Integration Examples
VSCode Copilot
Continue.dev
Cursor IDE
Works out of the box - just point to http://localhost:11435/v1
Why Shimmy Will Always Be Free
I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.
This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it. If you don't, just build something cool with it.
๐ก Shimmy saves you time and money. If it's useful, consider sponsoring for $5/month โ less than your Netflix subscription, infinitely more useful for developers.
API Reference
Endpoints
GET /health- Health checkPOST /v1/chat/completions- OpenAI-compatible chatGET /v1/models- List available modelsPOST /api/generate- Shimmy native APIGET /ws/generate- WebSocket streaming
CLI Commands
Technical Architecture
- Rust + Tokio: Memory-safe, async performance
- llama.cpp backend: Industry-standard GGUF inference
- OpenAI API compatibility: Drop-in replacement
- Dynamic port management: Zero conflicts, auto-allocation
- Zero-config auto-discovery: Just worksโข
๐ Advanced Features
- ๐ง MOE CPU Offloading: Hybrid GPU/CPU processing for large models (70B+)
- ๐ฏ Smart Model Filtering: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)
- ๐ก๏ธ 6-Gate Release Validation: Constitutional quality limits ensure reliability
- โก Smart Model Preloading: Background loading with usage tracking for instant model switching
- ๐พ Response Caching: LRU + TTL cache delivering 20-40% performance gains on repeat queries
- ๐ Integration Templates: One-command deployment for Docker, Kubernetes, Railway, Fly.io, FastAPI, Express
- ๐ Request Routing: Multi-instance support with health checking and load balancing
- ๐ Advanced Observability: Real-time metrics with self-optimization and Prometheus integration
- ๐ RustChain Integration: Universal workflow transpilation with workflow orchestration
Community & Support
- ๐ Bug Reports: GitHub Issues
- ๐ฌ Discussions: GitHub Discussions
- ๐ Documentation: docs/ โข Engineering Methodology โข OpenAI Compatibility Matrix โข Benchmarks (Reproducible)
- ๐ Sponsorship: GitHub Sponsors
Star History
๐ Momentum Snapshot
๐ฆ Sub-5MB single binary (142x smaller than Ollama)
๐ stars and climbing fast
โฑ <1s startup
๐ฆ 100% Rust, no Python
๐ฐ As Featured On
๐ฅ Hacker News โข Front Page Again โข IPE Newsletter
Companies: Need invoicing? Email michaelallenkuykendall@gmail.com
โก Performance Comparison
| Tool | Binary Size | Startup Time | Memory Usage | OpenAI API |
|---|---|---|---|---|
| Shimmy | 4.8MB | <100ms | 50MB | 100% |
| Ollama | 680MB | 5-10s | 200MB+ | Partial |
| llama.cpp | 89MB | 1-2s | 100MB | Via llama-server |
Quality & Reliability
Shimmy maintains high code quality through comprehensive testing:
- Comprehensive test suite with property-based testing
- Automated CI/CD pipeline with quality gates
- Runtime invariant checking for critical operations
- Cross-platform compatibility testing
Development Testing
Run the complete test suite:
# Using cargo aliases
# Using Makefile
See our testing approach for technical details.
License & Philosophy
MIT License - forever and always.
Philosophy: Infrastructure should be invisible. Shimmy is infrastructure.
Testing Philosophy: Reliability through comprehensive validation and property-based testing.
Forever maintainer: Michael A. Kuykendall Promise: This will never become a paid product Mission: Making local model inference simple and reliable