The 5MB Alternative to Ollama

Shimmy will be free forever. No asterisks. No "free for now." No pivot to paid.

What is Shimmy?

Shimmy is a 5.1MB single-binary local inference server that provides OpenAI API-compatible endpoints for GGUF models. It's designed to be the invisible infrastructure that just works.

Metric	Shimmy	Ollama
Binary Size	5.1MB 🏆	680MB
Startup Time	<100ms 🏆	5-10s
Memory Overhead	<50MB 🏆	200MB+
OpenAI Compatibility	100% 🏆	Partial
Port Management	Auto 🏆	Manual
Configuration	Zero 🏆	Manual

🎯 Perfect for Developers

Privacy: Your code stays on your machine
Cost: No per-token pricing, unlimited queries
Speed: Local inference = sub-second responses
Integration: Works with VSCode, Cursor, Continue.dev out of the box

BONUS: First-class LoRA adapter support - from training to production API in 30 seconds.

Quick Start (30 seconds)

Installation

# Install from crates.io (Linux, macOS, Windows)

cargo install shimmy


# Or download pre-built binary (Windows only)

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy.exe

⚠️ Windows Security Notice: Windows Defender may flag the binary as a false positive. This is common with unsigned Rust executables. Recommended: Use cargo install shimmy instead, or add an exclusion for shimmy.exe in Windows Defender.

Get Models

Shimmy auto-discovers models from:

Hugging Face cache: ~/.cache/huggingface/hub/
Ollama models: ~/.ollama/models/
Local directory: ./models/
Environment: SHIMMY_BASE_GGUF=path/to/model.gguf

# Download models that work out of the box

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir ./models/

huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF --local-dir ./models/

Start Server

# Auto-allocates port to avoid conflicts

shimmy serve


# Or use manual port

shimmy serve --bind 127.0.0.1:11435

Point your AI tools to the displayed port - VSCode Copilot, Cursor, Continue.dev all work instantly!

📦 Download & Install

Package Managers

Rust: cargo install shimmy
VS Code: Shimmy Extension
npm: npm install -g shimmy-js (coming soon)
Python: pip install shimmy (coming soon)

Direct Downloads

GitHub Releases: Latest binaries
Docker: docker pull shimmy/shimmy:latest (coming soon)

🍎 macOS Support

Full compatibility confirmed! Shimmy works flawlessly on macOS with Metal GPU acceleration.

# Install dependencies

brew install cmake rust


# Install shimmy

cargo install shimmy

✅ Verified working:

Intel and Apple Silicon Macs
Metal GPU acceleration (automatic)
Xcode 17+ compatibility
All LoRA adapter features

Integration Examples

VSCode Copilot

{
  "github.copilot.advanced": {
    "serverUrl": "http://localhost:11435"
  }
}

Continue.dev

{
  "models": [{
    "title": "Local Shimmy",
    "provider": "openai", 
    "model": "your-model-name",
    "apiBase": "http://localhost:11435/v1"
  }]
}

Cursor IDE

Works out of the box - just point to http://localhost:11435/v1

Why Shimmy Will Always Be Free

I built Shimmy because I was tired of 680MB binaries to run a 4GB model.

This is my commitment: Shimmy stays MIT licensed, forever. If you want to support development, sponsor it. If you don't, just build something cool with it.

Shimmy saves you time and money. If it's useful, consider sponsoring for $5/month — less than your Netflix subscription, infinitely more useful.

Performance Comparison

Tool	Binary Size	Startup Time	Memory Usage	OpenAI API
Shimmy	5.1MB	<100ms	50MB	100%
Ollama	680MB	5-10s	200MB+	Partial
llama.cpp	89MB	1-2s	100MB	None

API Reference

Endpoints

GET /health - Health check
POST /v1/chat/completions - OpenAI-compatible chat
GET /v1/models - List available models
POST /api/generate - Shimmy native API
GET /ws/generate - WebSocket streaming

CLI Commands

shimmy serve                    # Start server (auto port allocation)

shimmy serve --bind 127.0.0.1:8080  # Manual port binding

shimmy list                     # Show available models  

shimmy discover                 # Refresh model discovery

shimmy generate --name X --prompt "Hi"  # Test generation

shimmy probe model-name         # Verify model loads

Technical Architecture

Rust + Tokio: Memory-safe, async performance
llama.cpp backend: Industry-standard GGUF inference
OpenAI API compatibility: Drop-in replacement
Dynamic port management: Zero conflicts, auto-allocation
Zero-config auto-discovery: Just works™

Community & Support

🐛 Bug Reports: GitHub Issues
💬 Discussions: GitHub Discussions
📖 Documentation: docs/
💝 Sponsorship: GitHub Sponsors

License & Philosophy

MIT License - forever and always.

Philosophy: Infrastructure should be invisible. Shimmy is infrastructure.

Forever maintainer: Michael A. Kuykendall
Promise: This will never become a paid product
Mission: Making local AI development frictionless

"The best code is code you don't have to think about."

shimmy 0.1.1