shimmy 2.0.0

Lightweight Ollama-compatible inference server with native SafeTensors support. No Python dependencies, cross-platform WebGPU acceleration via Airframe.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
<div align="center">
  <img src="assets/shimmy-logo.png" alt="Shimmy Logo" width="300" height="auto" />

  # The Lightweight OpenAI API Server

  ### ๐Ÿ”’ Local Inference Without Dependencies ๐Ÿš€

  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
  [![Security](https://img.shields.io/badge/Security-Audited-green)](https://github.com/Michael-A-Kuykendall/shimmy/security)
  [![Crates.io](https://img.shields.io/crates/v/shimmy.svg)](https://crates.io/crates/shimmy)
  [![Downloads](https://img.shields.io/crates/d/shimmy.svg)](https://crates.io/crates/shimmy)
  [![Rust](https://img.shields.io/badge/rust-stable-brightgreen.svg)](https://rustup.rs/)
  [![GitHub Stars](https://img.shields.io/github/stars/Michael-A-Kuykendall/shimmy?style=social)](https://github.com/Michael-A-Kuykendall/shimmy/stargazers)

  [![๐Ÿ’ Sponsor this project](https://img.shields.io/badge/๐Ÿ’_Sponsor_this_project-ea4aaa?style=for-the-badge&logo=github&logoColor=white)](https://github.com/sponsors/Michael-A-Kuykendall)

  **Languages:** [็ฎ€ไฝ“ไธญๆ–‡](docs/zh-CN/README.md) ยท [็น้ซ”ไธญๆ–‡](docs/zh-TW/README.md)
</div>

**Shimmy will be free forever.** No asterisks. No "free for now." No pivot to paid.

### ๐Ÿ’ Support Shimmy's Growth


๐Ÿš€ **If Shimmy helps you, consider [sponsoring](https://github.com/sponsors/Michael-A-Kuykendall) โ€” 100% of support goes to keeping it free forever.**

- **$5/month**: Coffee tier โ˜• - Eternal gratitude + sponsor badge
- **$25/month**: Bug prioritizer ๐Ÿ› - Priority support + name in [SPONSORS.md]SPONSORS.md
- **$100/month**: Corporate backer ๐Ÿข - Logo placement + monthly office hours
- **$500/month**: Infrastructure partner ๐Ÿš€ - Direct support + roadmap input

[**๐ŸŽฏ Become a Sponsor**]https://github.com/sponsors/Michael-A-Kuykendall | See our amazing [sponsors]SPONSORS.md ๐Ÿ™

---

## Table of Contents


- [What Is Shimmy?]#drop-in-openai-api-replacement-for-local-llms
- [๐Ÿ”ฅ Airframe Engine (v2.0)]#-airframe-engine
- [๐ŸŽฏ Supported Models]#-supported-models
- [๐Ÿ“ฆ Migrating from v1.x]#-migrating-from-v1x
- [โšก Quick Start (30 seconds)]#quick-start-30-seconds
- [๐Ÿš€ OpenAI SDK Compatibility]#-compatible-with-openai-sdks-and-tools
- [๐Ÿ”ง Extended Context]#-extended-context
- [๐Ÿ“ฅ Download & Install]#-download--install
- [๐Ÿ”— Integration Examples]#integration-examples
- [๐Ÿ“– API Reference]#api-reference
- [โ“ FAQ]#-faq
- [๐Ÿ›๏ธ Technical Architecture]#technical-architecture
- [๐Ÿ“š Documentation Hub]#-documentation-hub
- [๐ŸŒ Community & Support]#community--support
- [โšก Performance]#-performance-comparison
- [License]#license--philosophy

---

## Drop-in OpenAI API Replacement for Local LLMs


Shimmy is a **single-binary** that provides **100% OpenAI-compatible endpoints** for GGUF models. Point your existing AI tools to Shimmy and they just work โ€” locally, privately, and free.

**๐ŸŽ‰ NEW in v2.0.0**: Shimmy now runs on [Airframe](#-airframe-engine), a pure-Rust WGSL GPU engine. No C++ toolchain, no backend flags, no compilation required.

## ๐Ÿ”ฅ Airframe Engine


Starting in v2.0.0, Shimmy's default inference engine is **Airframe** โ€” a pure-Rust WebGPU (WGSL) transformer runtime built from scratch.

**Why this matters:**
- No C++ toolchain required โ€” Rust only, top to bottom
- F32 precision throughout for deterministic, high-quality output
- WGSL compute shaders work on any GPU via WebGPU (NVIDIA, AMD, Intel, integrated)
- Model spec auto-derived from GGUF metadata โ€” no hardcoded per-model constants
- YaRN RoPE scaling for extended context via `SHIMMY_MAX_CTX` โ€” engine allocates KV cache and sets RoPE scale automatically (see [Extended Context]#-extended-context below)

**Quick start with Airframe (v2.0.0+):**
```bash
# Default: 2048-token context

SHIMMY_BASE_GGUF=/path/to/TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf ./shimmy serve

# Extended context (4096 tokens โ€” YaRN RoPE enabled automatically, KV cache resized)

SHIMMY_BASE_GGUF=/path/to/model.gguf SHIMMY_MAX_CTX=4096 ./shimmy serve
```

## ๐ŸŽฏ Supported Models


Airframe v2.0 ships with GPU-verified support across **7 model architectures** and **5 quantization types**, covering the models most commonly run on consumer hardware. Context window is read directly from each model's GGUF metadata โ€” no hardcoded limits.

### โœ… Locally Validated (GPU Math Verified)


| Model | Architecture | Quant | Size | Context | Min VRAM |
|---|---|---|---|---|---|
| [TinyLlama-1.1B-Chat-v1.0]https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF | Llama | Q4_0 | 638 MB | 2048 | ~800 MB |
| [Llama-3.2-1B-Instruct]https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF | Llama | Q4_K_M | 770 MB | 131072* | ~1 GB |
| [Llama-3.2-3B-Instruct]https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF | Llama | Q4_K_M | 1.9 GB | 131072* | ~2.5 GB |
| [phi-2]https://huggingface.co/TheBloke/phi-2-GGUF | Phi-2 | Q4_K_M | 1.7 GB | 2048 | ~2.2 GB |
| [gemma-2-2b-it]https://huggingface.co/bartowski/gemma-2-2b-it-GGUF | Gemma-2 | Q4_K_M | 1.6 GB | 8192 | ~2 GB |
| [starcoder2-3b]https://huggingface.co/second-state/StarCoder2-3B-GGUF | StarCoder2 | Q4_K_M | 1.8 GB | 16384 | ~2.3 GB |
| [gpt2]https://huggingface.co/ggerganov/ggml/blob/main/gpt-2-117M-q4_0.bin | GPT-2 | Q4_K_M | 107 MB | 1024 | ~200 MB |

> \* Llama-3.2's native context is 131072 tokens. Airframe reads this from GGUF and allocates KV cache accordingly. Use `SHIMMY_MAX_CTX=8192` for a practical 8K window on consumer hardware (~256 MB KV cache for the 1B model).

**GPU Math Verified** means the Airframe GPU dequantization shader produces results matching the CPU reference implementation, independently confirmed for every tensor type in each model. This is done via `quant_verify`, which tests 512 elements per quantization type per model.

### โณ Roadmap โ€” Larger Models (Require โ‰ฅ16 GB VRAM)


| Model | Architecture | Quant | Size | Status |
|---|---|---|---|---|
| deepseek-coder-6.7b-instruct | Llama | Q4_K_M | 3.9 GB | Pending remote GPU validation |
| deepseek-llm-7b-chat | Llama | Q4_K_M | 4.0 GB | Pending remote GPU validation |
| qwen2-7b-instruct | Qwen2 | Q4_K_M | 4.5 GB | Pending remote GPU validation |
| Phi-3.5-mini-instruct | Phi-3 | Q4_K_M | 2.3 GB | Requires fused QKV support (planned) |

### โœ… Supported Quantization Types


| Type | GGML ID | Notes |
|---|---|---|
| `F32` | 0 | Raw floats โ€” maximum precision |
| `F16` | 1 | Half-precision floats |
| `Q4_0` | 2 | 4-bit, 32-element blocks |
| `Q8_0` | 8 | 8-bit, 32-element blocks |
| `Q4_K` | 12 | 4-bit K-quant superblocks (256 elements) โ€” used in Q4_K_M GGUFs |
| `Q5_K` | 13 | 5-bit K-quant superblocks โ€” used alongside Q4_K in mixed-precision models |
| `Q6_K` | 14 | 6-bit K-quant superblocks โ€” typically used for output/embedding layers |

All types are implemented in both the GPU inference shader and a CPU reference implementation. GPU vs CPU agreement is validated for every type.

**Auto-discovery is enabled.** If Shimmy finds GGUF models in your HuggingFace cache, Ollama directory, LM Studio cache (`~/.cache/lm-studio/models`), or local `./models/` folder, it will register and serve them automatically. See [docs/MODEL_EXPANSION.md](docs/MODEL_EXPANSION.md) for the full compatibility matrix.

## ๐Ÿ“ฆ Migrating from v1.x


The llama.cpp backend is **removed in v2.0.0**. The Airframe engine is the only inference path.
See [docs/MIGRATION_v2.md](docs/MIGRATION_v2.md) for the step-by-step migration guide.

## Developer Tools


Whether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.

### Try it in 30 seconds


```bash
# 1) Download pre-built binary

# Windows:

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe
set SHIMMY_BASE_GGUF=C:\path\to\model.gguf && ./shimmy.exe serve &

# Linux / macOS:

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy
SHIMMY_BASE_GGUF=/path/to/model.gguf ./shimmy serve &

# 2) See registered models

./shimmy list

# 3) Smoke test the OpenAI API

curl -s http://127.0.0.1:11435/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
        "model":"tinyllama-1.1b",
        "messages":[{"role":"user","content":"Say hi in 5 words."}],
        "max_tokens":32
      }' | jq -r '.choices[0].message.content'
```

## ๐Ÿš€ Compatible with OpenAI SDKs and Tools


**No code changes needed** - just change the API endpoint:

- **Any OpenAI client**: Python, Node.js, curl, etc.
- **Development applications**: Compatible with standard SDKs
- **VSCode Extensions**: Point to `http://localhost:11435`
- **Cursor Editor**: Built-in OpenAI compatibility
- **Continue.dev**: Drop-in model provider

### Use with OpenAI SDKs


- Node.js (openai v4)

```ts
import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "http://127.0.0.1:11435/v1",
  apiKey: "sk-local", // placeholder, Shimmy ignores it
});

const resp = await openai.chat.completions.create({
  model: "REPLACE_WITH_MODEL",
  messages: [{ role: "user", content: "Say hi in 5 words." }],
  max_tokens: 32,
});

console.log(resp.choices[0].message?.content);
```

- Python (openai>=1.0.0)

```python
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:11435/v1", api_key="sk-local")

resp = client.chat.completions.create(
    model="REPLACE_WITH_MODEL",
    messages=[{"role": "user", "content": "Say hi in 5 words."}],
    max_tokens=32,
)

print(resp.choices[0].message.content)
```

## โšก Zero Configuration Required


- **Automatically finds models** from Hugging Face cache, Ollama, LM Studio (`~/.cache/lm-studio/models`), and local dirs
- **Auto-allocates ports** to avoid conflicts
- **Auto-detects LoRA adapters** for specialized models
- **Just works** - no config files, no setup wizards

## ๐Ÿง  Advanced MOE (Mixture of Experts) Support


> **Note**: MoE (Mixture of Experts) CPU offloading is on the Airframe roadmap. See [docs/AIRFRAME_MOE_ROADMAP.md]docs/AIRFRAME_MOE_ROADMAP.md for the implementation plan.

**Run 70B+ models on consumer hardware** โ€” coming to the Airframe engine. Track progress on the [roadmap](docs/ROADMAP.md).

**Perfect for**: Large models (70B+), limited VRAM systems, cost-effective inference

## ๐ŸŽฏ Perfect for Local Development


- **Privacy**: Your code never leaves your machine
- **Cost**: No API keys, no per-token billing
- **Speed**: Local inference, sub-second responses
- **Reliability**: No rate limits, no downtime

## Quick Start (30 seconds)


### Installation


**v2.0.0**: Download pre-built binaries with Airframe WebGPU engine included!

#### **๐Ÿ“ฅ Pre-Built Binaries (Recommended โ€” Zero Dependencies)**


Pick your platform and download โ€” no compilation needed, GPU acceleration included:

```bash
# Windows x64 (Airframe WebGPU engine)

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe -o shimmy.exe

# Linux x86_64 (Airframe WebGPU engine)

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 -o shimmy && chmod +x shimmy

# macOS ARM64 (Airframe with Metal backend via wgpu)

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy

# macOS Intel

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel -o shimmy && chmod +x shimmy

# Linux ARM64 (huggingface engine; Airframe cross-compilation not yet supported)

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 -o shimmy && chmod +x shimmy
```

**That's it!** The Airframe WebGPU adapter is selected automatically at runtime.

#### **๐Ÿ› ๏ธ Build from Source / cargo install**


```bash
# Install from crates.io (huggingface engine โ€” works without GPU)

cargo install shimmy

# Build from source with Airframe GPU engine (requires airframe submodule)

git clone https://github.com/Michael-A-Kuykendall/shimmy --recurse-submodules
cd shimmy
cargo build --release --features airframe,huggingface
```

> **Note**: The GitHub Releases binaries already include the Airframe engine. Building from source with `--features airframe` is for contributors or custom builds.

### GPU Acceleration


**v2.0.0**: Airframe uses **WebGPU (wgpu)** for GPU acceleration. No backend flags, no driver installation beyond standard OS graphics drivers.

#### **๐Ÿ“ฅ Download Pre-Built Binaries (Recommended)**


Release binaries include the Airframe engine with WebGPU support compiled in:

| Platform | Download | GPU Backend | Notes |
|----------|----------|-------------|-------|
| **Windows x64** | [shimmy-windows-x86_64.exe]https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-windows-x86_64.exe | WebGPU (wgpu) | NVIDIA, AMD, Intel |
| **Linux x86_64** | [shimmy-linux-x86_64]https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-x86_64 | WebGPU (wgpu) | NVIDIA, AMD, Intel |
| **macOS ARM64** | [shimmy-macos-arm64]https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 | Metal (via wgpu) | Apple Silicon |
| **macOS Intel** | [shimmy-macos-intel]https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-intel | Metal (via wgpu) | Intel Mac |
| **Linux ARM64** | [shimmy-linux-aarch64]https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-linux-aarch64 | huggingface only | ARM cross-build |

#### **๐ŸŽฏ How GPU Selection Works**


Airframe uses wgpu's adapter enumeration. On first launch it selects the best available GPU adapter for your system โ€” discrete GPU preferred over integrated, integrated over CPU fallback. No configuration needed.

```bash
# Check selected adapter

shimmy gpu-info

# Start serving (GPU adapter auto-selected)

shimmy serve
```

#### **๐Ÿ”ง Extended Context**


`SHIMMY_MAX_CTX` overrides the context window at the engine level. When set above the model's native window, Airframe automatically engages YaRN RoPE scaling and resizes the KV cache accordingly.

```bash
# 4096-token context with YaRN (2x native window for TinyLlama)

SHIMMY_BASE_GGUF=/path/to/model.gguf SHIMMY_MAX_CTX=4096 shimmy serve

# 8192 tokens (4x native, higher RoPE compression)

SHIMMY_BASE_GGUF=/path/to/model.gguf SHIMMY_MAX_CTX=8192 shimmy serve
```

> **Note:** Extended context beyond 4096 is functional but not yet as deeply validated as the native 2048-token window. Accepted range is 512โ€“131072. Values outside that range are silently ignored and 2048 is used.

#### **๐Ÿ’พ VRAM Sizing Reference**


Airframe allocates VRAM at load time: **weights** + **KV cache**. The KV cache is F32 and scales linearly with context length (`n_layers ร— n_kv_heads ร— head_dim ร— ctx ร— 2 ร— 4 bytes`).

**TinyLlama 1.1B Q4_0 โ€” the v2.0 validated path:**

| Context (`SHIMMY_MAX_CTX`) | KV cache | Weights | Total | Min VRAM |
|---|---|---|---|---|
| 2048 (default) | ~88 MB | ~638 MB | ~726 MB | **~800 MB** |
| 4096 | ~176 MB | ~638 MB | ~814 MB | **~900 MB** |
| 8192 | ~352 MB | ~638 MB | ~990 MB | **~1.1 GB** |
| 16384 | ~704 MB | ~638 MB | ~1.3 GB | **~1.5 GB** |

> Integrated graphics (Intel Iris, Apple M-series unified memory, AMD Vega) running at 2048 context is ~800 MB โ€” comfortably inside the 2 GB allocation most integrated GPUs share with system RAM.

**Scaling up to larger models** (architecture and quant support required โ€” see [docs/MODEL_EXPANSION.md]docs/MODEL_EXPANSION.md):

| Model | Quant | Weights | KV @ 2048 ctx | Min VRAM |
|---|---|---|---|---|
| Llama 3.2 1B | Q4_0 | ~636 MB | ~128 MB | ~900 MB |
| Llama 3.2 3B | Q4_0 | ~1.9 GB | ~448 MB | ~2.5 GB |
| Mistral 7B | Q4_K_M | ~4.1 GB | ~512 MB | ~5 GB |
| Llama 3 8B | Q4_K_M | ~4.7 GB | ~512 MB | ~5.5 GB |

The KV cache formula for any model: `n_layers ร— n_kv_heads ร— head_dim ร— ctx ร— 2 ร— 4 bytes`. Multiply the 2048 baseline by your `SHIMMY_MAX_CTX` multiplier to get the extended context allocation.

### Get Models


Shimmy auto-discovers models from:
- **Hugging Face cache**: `~/.cache/huggingface/hub/`
- **Ollama models**: `~/.ollama/models/`
- **Local directory**: `./models/`
- **Environment**: `SHIMMY_BASE_GGUF=path/to/model.gguf`

```bash
# Primary validated model for Airframe v2.0

huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF \
  --include "tinyllama-1.1b-chat-v1.0.Q4_0.gguf" --local-dir ./models/

# Alternative 1B โ€” also fits in the same hardware envelope

huggingface-cli download bartowski/Llama-3.2-1B-Instruct-GGUF \
  --include "*Q4_K_M*" --local-dir ./models/
```

### Start Server


```bash
# Auto-allocates port to avoid conflicts

shimmy serve

# Or use manual port

shimmy serve --bind 127.0.0.1:11435
```

Point your development tools to the displayed port โ€” VSCode Copilot, Cursor, Continue.dev all work instantly.

## ๐Ÿ“ฆ Download & Install


### Package Managers

- **Rust**: [`cargo install shimmy`]https://crates.io/crates/shimmy *(installs huggingface engine; for Airframe GPU, use GitHub Releases binaries)*
- **VS Code**: [Shimmy Extension]https://marketplace.visualstudio.com/items?itemName=targetedwebresults.shimmy-vscode
- **npm**: `npm install -g shimmy-js` *(planned)*
- **Python**: `pip install shimmy` *(planned)*

### Direct Downloads

- **GitHub Releases**: [Latest binaries]https://github.com/Michael-A-Kuykendall/shimmy/releases/latest
- **Docker**: `docker pull shimmy/shimmy:latest` *(coming soon)*

### ๐ŸŽ macOS Support


**Full compatibility confirmed!** Shimmy works on macOS with Metal GPU acceleration via wgpu.

```bash
# Install from crates.io (huggingface engine)

cargo install shimmy

# For Airframe GPU engine, download the macOS binary from GitHub Releases:

curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/latest/download/shimmy-macos-arm64 -o shimmy && chmod +x shimmy
```

**โœ… Verified working:**
- Intel and Apple Silicon Macs
- Metal GPU acceleration via wgpu (automatic on Apple Silicon)
- Xcode 17+ compatibility

## Integration Examples


### VSCode Copilot

```json
{
  "github.copilot.advanced": {
    "serverUrl": "http://localhost:11435"
  }
}
```

### Continue.dev

```json
{
  "models": [{
    "title": "Local Shimmy",
    "provider": "openai",
    "model": "your-model-name",
    "apiBase": "http://localhost:11435/v1"
  }]
}
```

### Cursor IDE

Works out of the box - just point to `http://localhost:11435/v1`

## Why Shimmy Will Always Be Free


I built Shimmy to retain privacy-first control on my AI development and keep things local and lean.

**This is my commitment**: Shimmy stays MIT licensed, forever. If you want to support development, [sponsor it](https://github.com/sponsors/Michael-A-Kuykendall). If you don't, just build something cool with it.

> ๐Ÿ’ก **Shimmy saves you time and money. If it's useful, consider [sponsoring for $5/month]https://github.com/sponsors/Michael-A-Kuykendall โ€” less than your Netflix subscription, infinitely more useful for developers.**

## API Reference


### Endpoints

- `GET /health` - Health check
- `POST /v1/chat/completions` - OpenAI-compatible chat (streaming supported)
- `POST /v1/completions` - OpenAI-compatible text completions
- `GET /v1/models` - List available models
- `POST /api/generate` - Shimmy native API
- `GET /ws/generate` - WebSocket streaming

### Environment Variables


| Variable | Default | Description |
|---|---|---|
| `SHIMMY_BASE_GGUF` | *(auto-discover)* | Path to GGUF model file loaded as the default model |
| `SHIMMY_PORT` | `8080` | Port to listen on (Airframe server binary) |
| `SHIMMY_BIND_ADDRESS` | `0.0.0.0:8080` | Full bind address (overrides port) |
| `SHIMMY_MAX_CTX` | *(from GGUF)* | Override context window; activates YaRN RoPE scaling when above model native |
| `SHIMMY_MODEL_PATHS` | *(see Zero Config)* | Colon-separated extra model search paths |
| `SHIMMY_ENGINE_BACKEND` | `airframe` | `airframe` (default) or `llama` (legacy path) |
| `SHIMMY_ROPE_SCALE` | *(auto)* | Override computed YaRN scale factor |
| `RUST_BACKTRACE` | *(off)* | Set to `1` to print crash backtraces |

### CLI Commands

```bash
shimmy serve                              # Start server (auto port allocation)
shimmy serve --bind 127.0.0.1:8080        # Manual port binding
shimmy serve --gpu-backend auto           # WebGPU adapter auto-select (default)
shimmy serve --gpu-backend cpu            # Force CPU (disable GPU)
shimmy list                               # Show available models
shimmy discover                           # Refresh model discovery
shimmy generate --name X --prompt "Hi"   # Test generation
shimmy probe model-name                   # Verify model loads
shimmy gpu-info                           # Show selected WebGPU adapter
```

## Technical Architecture


- **Rust + Tokio**: Memory-safe, async performance
- **Airframe engine**: Pure-Rust WGSL GPU inference โ€” no C++ toolchain, deterministic output, GGUF-native
- **OpenAI API compatibility**: Drop-in replacement
- **Dynamic port management**: Zero conflicts, auto-allocation
- **Zero-config auto-discovery**: Just worksโ„ข

### ๐Ÿš€ Advanced Features


- **๐Ÿง  MOE CPU Offloading**: Hybrid GPU/CPU processing for large models (70B+)
- **๐ŸŽฏ Smart Model Filtering**: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)
- **๐Ÿ›ก๏ธ 6-Gate Release Validation**: Constitutional quality limits ensure reliability
- **โšก Smart Model Preloading**: Background loading with usage tracking for instant model switching
- **๐Ÿ’พ Response Caching**: LRU + TTL cache delivering 20-40% performance gains on repeat queries
- **๐Ÿš€ Integration Templates**: One-command deployment for Docker, Kubernetes, Railway, Fly.io, FastAPI, Express
- **๐Ÿ”„ Request Routing**: Multi-instance support with health checking and load balancing
- **๐Ÿ“Š Advanced Observability**: Real-time metrics with self-optimization and Prometheus integration
- **๐Ÿ”— RustChain Integration**: Universal workflow transpilation with workflow orchestration

---

## โ“ FAQ


**Does Shimmy work on my GPU?**
Shimmy uses WebGPU (via the Airframe engine) which runs on Vulkan, D3D12, and Metal โ€” covering NVIDIA, AMD, Intel, and Apple Silicon. No CUDA required. See [GPU requirements in TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) if you hit adapter errors.

**What's the difference between Shimmy and llama.cpp / Ollama?**
Shimmy is written in pure Rust with no C++ toolchain dependency. The Airframe engine runs WGSL compute shaders compiled at startup โ€” no pre-built binaries, no driver version pinning. The result is faster startup, lower memory overhead, and deterministic output. See the [GPU Pipeline doc](docs/GPU_PIPELINE.md) for internals.

**Why do I need `SHIMMY_BASE_GGUF` or `LIBSHIMMY_MODEL_PATH`?**
If you don't set these, Shimmy auto-discovers models in standard directories (`~/.cache/huggingface`, `~/.ollama`, `~/lm-studio/models`, `~/.cache/lm-studio/models`, `~/Library/Application Support/LMStudio`). Set `SHIMMY_BASE_GGUF` to override and point directly at a specific GGUF file.

**Can I run multiple models at once?**
Not currently โ€” Shimmy loads one model per server instance. To serve multiple models, run multiple server instances on different ports. Hot-swapping models (reload without restart) is on the roadmap.

**Why does generation stop before `max_tokens`?**
The model reached a natural end-of-sequence token. For chat models this is expected behavior โ€” the model signals it's done. If you want to force longer output, increase `max_tokens` and set `temperature > 0`. If generation stops on the wrong token, the model may be using the wrong chat template โ€” see [CHAT_TEMPLATES.md](docs/CHAT_TEMPLATES.md).

**Is there streaming support?**
Set `"stream": true` in your request. Shimmy returns Server-Sent Events in the standard OpenAI streaming format.

**Q4_K_M vs Q4_0 โ€” which should I use?**
`Q4_K_M` (K-quant) is consistently better quality than `Q4_0` for the same file size. Use `Q4_0` only when you need maximum compatibility or the model isn't available in K-quant. See [QUANTIZATION.md](docs/QUANTIZATION.md) for the full analysis.

**Can I extend the context window beyond what the model was trained on?**
Yes โ€” set `SHIMMY_MAX_CTX` to any value. Airframe applies YaRN scaling automatically when the requested context exceeds the model's native context. Quality degrades gradually beyond 2ร— the native context. See [EXTENDED_CONTEXT.md](docs/EXTENDED_CONTEXT.md).

---

## ๐Ÿ“š Documentation Hub


Full documentation lives in [docs/](docs/). Use this table to find what you need:

### Getting Started

| Document | Description |
|---|---|
| [docs/quickstart.md]docs/quickstart.md | 5-minute getting started guide |
| [docs/MIGRATION_v2.md]docs/MIGRATION_v2.md | Migrating from Shimmy v1.x |
| [docs/CONFIGURATION.md]docs/CONFIGURATION.md | All environment variables and config options |
| [docs/WINDOWS_GPU_BUILD_GUIDE.md]docs/WINDOWS_GPU_BUILD_GUIDE.md | Windows-specific build instructions |

### API & Integration

| Document | Description |
|---|---|
| [docs/API.md]docs/API.md | Complete API endpoint reference |
| [docs/OPENAI_COMPAT.md]docs/OPENAI_COMPAT.md | OpenAI compatibility matrix โ€” what's supported |
| [docs/INTEGRATION.md]docs/INTEGRATION.md | Integrating with LangChain, OpenAI SDKs, etc. |
| [docs/EXAMPLES.md]docs/EXAMPLES.md | Runnable code examples |
| [docs/CROSS_COMPILATION.md]docs/CROSS_COMPILATION.md | Building for other targets (ARM, Linux from Windows) |

### Engine Deep Dives

| Document | Description |
|---|---|
| [docs/ARCHITECTURE.md]docs/ARCHITECTURE.md | System-level architecture and component map |
| [docs/GPU_PIPELINE.md]docs/GPU_PIPELINE.md | Bindless GPU architecture, WGSL shaders, dispatch patterns |
| [docs/QUANTIZATION.md]docs/QUANTIZATION.md | Q4_0, Q8_0, K-quant formats โ€” bit-level internals |
| [docs/EXTENDED_CONTEXT.md]docs/EXTENDED_CONTEXT.md | YaRN RoPE scaling, VRAM math, context extension |
| [docs/CHAT_TEMPLATES.md]docs/CHAT_TEMPLATES.md | Chat template auto-detection and format reference |
| [docs/MODEL_EXPANSION.md]docs/MODEL_EXPANSION.md | Model onboarding protocol and acceptance gates |

### Troubleshooting & Reference

| Document | Description |
|---|---|
| [docs/TROUBLESHOOTING.md]docs/TROUBLESHOOTING.md | Diagnostic guide for GPU errors, model failures, port conflicts |
| [docs/PERFORMANCE.md]docs/PERFORMANCE.md | Performance tuning and token/sec benchmarks |
| [docs/FEATURES.md]docs/FEATURES.md | Complete feature list |

### Development & Methodology

| Document | Description |
|---|---|
| [docs/METHODOLOGY.md]docs/METHODOLOGY.md | Engineering methodology and quality standards |
| [docs/REGRESSION_TESTING.md]docs/REGRESSION_TESTING.md | Regression testing approach |
| [docs/ppt-invariant-testing.md]docs/ppt-invariant-testing.md | Property-based and invariant testing details |
| [docs/METRICS.md]docs/METRICS.md | Observability and metrics reference |

---

## Community & Support


- **๐Ÿ› Bug Reports**: [GitHub Issues]https://github.com/Michael-A-Kuykendall/shimmy/issues
- **๐Ÿ’ฌ Discussions**: [GitHub Discussions]https://github.com/Michael-A-Kuykendall/shimmy/discussions
- **๐Ÿ“– Documentation**: [Full Documentation Hub]#-documentation-hub โ€ข [Migration Guide v1โ†’v2]docs/MIGRATION_v2.md โ€ข [Engineering Methodology]docs/METHODOLOGY.md โ€ข [OpenAI Compatibility Matrix]docs/OPENAI_COMPAT.md โ€ข [Architecture]docs/ARCHITECTURE.md โ€ข [GPU Pipeline]docs/GPU_PIPELINE.md โ€ข [Troubleshooting]docs/TROUBLESHOOTING.md
- **๐Ÿ’ Sponsorship**: [GitHub Sponsors]https://github.com/sponsors/Michael-A-Kuykendall

### Star History


[![Star History Chart](https://api.star-history.com/svg?repos=Michael-A-Kuykendall/shimmy&type=Timeline)](https://www.star-history.com/#Michael-A-Kuykendall/shimmy&Timeline)

### ๐Ÿš€ Momentum Snapshot


๐ŸŒŸ **![GitHub stars](https://img.shields.io/github/stars/Michael-A-Kuykendall/shimmy?style=flat&color=yellow) stars and climbing fast**
โฑ **<1s startup**
๐Ÿฆ€ **100% Rust, no Python**

### ๐Ÿ“ฐ As Featured On


๐Ÿ”ฅ [**Hacker News**](https://news.ycombinator.com/item?id=45130322) โ€ข [**Front Page Again**](https://news.ycombinator.com/item?id=45199898) โ€ข [**IPE Newsletter**](https://ipenewsletter.substack.com/p/the-strange-new-side-hustles-of-openai)

**Companies**: Need invoicing? Email [michaelallenkuykendall@gmail.com](mailto:michaelallenkuykendall@gmail.com)

## โšก Performance Comparison


| Tool | Startup Time | Memory Usage | OpenAI API |
|------|--------------|--------------|------------|
| **Shimmy** | **<100ms** | **50MB** | **100%** |
| Ollama | 5-10s | 200MB+ | Partial |

## Quality & Reliability


Shimmy maintains high code quality through comprehensive testing:

- **Comprehensive test suite** with property-based testing
- **Automated CI/CD pipeline** with quality gates
- **Runtime invariant checking** for critical operations
- **Cross-platform compatibility testing**
### Development Testing

Run the complete test suite:

```bash
# Using cargo aliases

cargo test-quick           # Quick development tests

# Using Makefile  

make test                  # Full test suite
make test-quick            # Quick development tests
```

See our [testing approach](docs/ppt-invariant-testing.md) for technical details.

---

## License & Philosophy


MIT License - forever and always.

**Philosophy**: Infrastructure should be invisible. Shimmy is infrastructure.

**Testing Philosophy**: Reliability through comprehensive validation and property-based testing.

---

**Forever maintainer**: Michael A. Kuykendall
**Promise**: This will never become a paid product
**Mission**: Making local model inference simple and reliable