shimmy 1.1.0

Lightweight 5MB Ollama alternative for local AI inference. Fast startup, reliable inference engine.
Documentation
# API Reference


Shimmy provides multiple API interfaces for local LLM inference.

## HTTP REST API


### Generate Text


**Endpoint:** `POST /api/generate`

**Request Body:**
```json
{
  "model": "string",           // Model name (required)
  "prompt": "string",          // Input prompt (required)
  "max_tokens": 100,          // Maximum tokens to generate (optional, default: 100)
  "temperature": 0.7,         // Sampling temperature (optional, default: 0.7)
  "stream": false             // Enable streaming response (optional, default: false)
}
```

**Non-Streaming Response:**
```json
{
  "choices": [
    {
      "text": "Generated text response",
      "index": 0,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 20,
    "total_tokens": 30
  }
}
```

**Streaming Response:**
Server-Sent Events with data chunks:
```
data: {"choices":[{"text":"Hello","index":0}]}

data: {"choices":[{"text":" world","index":0}]}

data: [DONE]
```

### List Models


**Endpoint:** `GET /api/models`

**Response:**
```json
{
  "models": [
    {
      "id": "default",
      "name": "Default Model",
      "description": "Base GGUF model"
    }
  ]
}
```

### Health Check


**Endpoint:** `GET /api/health`

**Response:**
```json
{
  "status": "healthy",
  "models_loaded": 1,
  "memory_usage": "2.1GB"
}
```

## WebSocket API


**Endpoint:** `ws://localhost:11435/ws/generate`

### Connect and Send

```json
{
  "model": "default",
  "prompt": "Hello world",
  "max_tokens": 50,
  "temperature": 0.7
}
```

### Receive Tokens

```json
{"token": "Hello"}
{"token": " world"}
{"done": true}
```

## CLI Interface


### Commands


```bash
# Start server

shimmy serve --bind 127.0.0.1:11435 --port 11435

# Generate text

shimmy generate --prompt "Hello" --max-tokens 50 --temperature 0.7

# List available models

shimmy list

# Probe model loading

shimmy probe [model-name]

# Show diagnostics

shimmy diag
```

### Global Options


- `--verbose, -v`: Enable verbose logging
- `--help, -h`: Show help information
- `--version, -V`: Show version information

## Error Responses


All endpoints return consistent error formats:

```json
{
  "error": {
    "code": "model_not_found",
    "message": "The specified model was not found",
    "details": "Model 'invalid-model' is not available"
  }
}
```

Common error codes:
- `model_not_found`: Requested model is not available
- `invalid_request`: Request format is invalid
- `generation_failed`: Text generation failed
- `server_error`: Internal server error

## Rate Limiting


Currently no rate limiting is implemented. For production use, consider placing shimmy behind a reverse proxy with rate limiting capabilities.