Shimmy - The 5MB Alternative to Ollama
Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.
Fast, reliable local AI inference. Shimmy provides OpenAI-compatible endpoints for GGUF models with comprehensive testing and automated quality assurance.
What is Shimmy?
Shimmy is a 5.1MB single-binary local inference server that provides OpenAI API-compatible endpoints for GGUF models. It's designed to be the invisible infrastructure that just works.
| Metric | Shimmy | Ollama |
|---|---|---|
| Binary Size | 5.1MB 🏆 | 680MB |
| Startup Time | <100ms 🏆 | 5-10s |
| Memory Overhead | <50MB 🏆 | 200MB+ |
| OpenAI Compatibility | 100% 🏆 | Partial |
| Port Management | Auto 🏆 | Manual |
| Configuration | Zero 🏆 | Manual |
Why Choose Shimmy?
- Zero Configuration: Auto-discovers models and assigns ports
- Native SafeTensors: No Python dependencies, 2x faster loading
- OpenAI Compatible: Drop-in replacement for OpenAI API calls
- Cross-Platform: Windows, macOS, Linux (including ARM64)
- Integration: Works with VSCode, Cursor, Continue.dev out of the box
BONUS: First-class LoRA adapter support - from training to production API in 30 seconds.
Quick Start (30 seconds)
# Install via cargo
# Auto-discover models and start server
# 🚀 Server running at http://localhost:11435
# ✅ Found 3 models: llama-3.2-1b, phi-3-mini, mistral-7b
# 📡 OpenAI API compatible endpoints ready
Point your AI tools to the displayed port - VSCode Copilot, Cursor, Continue.dev all work instantly!
📦 Installation
Package Managers
- Rust:
cargo install shimmy - VS Code: Shimmy Extension
Direct Downloads
- GitHub Releases: Latest binaries
- Docker:
docker pull ghcr.io/michael-a-kuykendall/shimmy:latest
🐳 Docker Setup
Quick Start:
# Clone repo and run (builds locally)
# Start with docker-compose (builds locally)
# Or pull from GitHub Container Registry
🍎 macOS Support
Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.
# Install dependencies
# Install shimmy
✅ Verified working:
- Intel and Apple Silicon Macs
- Metal GPU acceleration (automatic)
- Xcode 17+ compatibility
- All LoRA adapter features
Integration Examples
VSCode Copilot
Continue.dev
Cursor
Direct API Usage
# List available models
# Chat completion
🚀 Features
- 🔍 Auto-Discovery: Finds GGUF models automatically in standard locations
- 🎯 Smart Port Management: Assigns unique ports per model (11435, 11436, ...)
- ⚡ Fast Loading: Native SafeTensors support, no Python overhead
- 🔧 Zero Config: Works out of the box with sensible defaults
- 🎨 LoRA Support: Load LoRA adapters with
--loraflag - 📊 Monitoring: Built-in metrics and health endpoints
- 🐳 Docker Ready: Full containerization support
- 🔌 Plugin System: Extensible architecture for custom features
Command Reference
# Serve all models (auto-discovery)
# Serve specific model
# With LoRA adapter
# Custom port and host
# Discover models without serving
# Show version and build info
Development
# Build from source
# Full build with all features
# Minimal build (SafeTensors only)
License
MIT License - see LICENSE for details.
Support
- ⭐ Star us on GitHub: github.com/Michael-A-Kuykendall/shimmy
- 💬 Discussions: GitHub Discussions
- 🐛 Issues: GitHub Issues
- ❤️ Sponsor: GitHub Sponsors
Made with ❤️ for the AI development community