inferno-ai 0.10.3

# Inferno Examples

This directory contains comprehensive examples for using the Inferno AI/ML inference server.

## 📋 Table of Contents

- [Quick Start](#quick-start)
- [Client Examples](#client-examples)
- [Docker Deployment](#docker-deployment)
- [API Examples](#api-examples)
- [Production Setup](#production-setup)
- [Troubleshooting](#troubleshooting)

## 🚀 Quick Start

### 1. Start Inferno Server

```bash
# Build and run locally
cargo run -- serve --bind 0.0.0.0:8080

# Or using Docker
docker run -p 8080:8080 inferno:latest serve
```

### 2. Test Basic Functionality

```bash
# Health check
curl http://localhost:8080/health

# List models
curl -H "Authorization: Bearer YOUR_API_KEY" \
  http://localhost:8080/models

# Run inference
curl -X POST \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-2-7b", "prompt": "Hello!", "max_tokens": 50}' \
  http://localhost:8080/inference
```

## 👨‍💻 Client Examples

### Python Client (`python_client.py`)

A comprehensive Python client with support for:
- Synchronous and streaming inference
- Batch processing
- WebSocket real-time communication
- OpenAI-compatible API
- Embeddings generation

```bash
# Install dependencies
pip install requests websocket-client sseclient-py

# Run example
python python_client.py
```

**Key Features:**
```python
client = InfernoClient(api_key="your_key")

# Simple inference
response = client.inference("llama-2-7b", "What is AI?", max_tokens=100)

# Streaming
for token in client.stream_inference("llama-2-7b", "Tell a story"):
    print(token, end="")

# Chat completion (OpenAI compatible)
response = client.chat_completion("llama-2-7b", [
    {"role": "user", "content": "Hello!"}
])
```

### JavaScript/Node.js Client (`javascript_client.js`)

Full-featured JavaScript client for both Node.js and browser environments:

```bash
# Install dependencies
npm install node-fetch ws eventsource

# Run example
node javascript_client.js
```

**Key Features:**
```javascript
const client = new InfernoClient('http://localhost:8080', 'your_key');

// Async/await inference
const response = await client.inference('llama-2-7b', 'Hello world');

// WebSocket streaming
const wsClient = new InfernoWebSocketClient();
await wsClient.connect();
wsClient.sendInference('llama-2-7b', 'Tell me a joke');
```

### Rust Client (`rust_client.rs`)

High-performance Rust client with full async support:

```bash
# Install cargo-script
cargo install cargo-script

# Run example
cargo script rust_client.rs
```

**Key Features:**
```rust
let client = InfernoClient::new("http://localhost:8080", Some("your_key".to_string()));

// Async inference
let response = client.inference("llama-2-7b", "Hello", 100, 0.7).await?;

// WebSocket streaming
let ws_client = InfernoWebSocketClient::new("ws://localhost:8080/ws", Some("your_key".to_string()));
ws_client.run_streaming("llama-2-7b", "Tell a story").await?;
```

### Go Client (`go_client.go`)

Efficient Go client with comprehensive API coverage:

```bash
# Initialize module
go mod init inferno-example
go get github.com/gorilla/websocket

# Run example
go run go_client.go
```

**Key Features:**
```go
client := NewClient("http://localhost:8080", "your_key")

// Simple inference
response, err := client.Inference("llama-2-7b", "Hello", 100, 0.7)

// WebSocket
wsClient := NewWebSocketClient("ws://localhost:8080/ws", "your_key")
wsClient.Connect()
wsClient.SendInference("llama-2-7b", "Tell a joke", 50)
```

## 🐳 Docker Deployment

### Complete Stack (`docker-compose.yml`)

Deploy Inferno with full observability stack:

```bash
# Start complete stack
docker-compose up -d

# View logs
docker-compose logs -f inferno

# Scale Inferno instances
docker-compose up -d --scale inferno=3
```

**Included Services:**
- **Inferno**: Main inference server (port 8080)
- **Prometheus**: Metrics collection (port 9091)
- **Grafana**: Visualization dashboards (port 3000)
- **Jaeger**: Distributed tracing (port 16686)
- **Redis**: Caching layer (port 6379)
- **Nginx**: Load balancer (port 80/443)

### Access Services

```bash
# Inferno API
curl http://localhost:8080/health

# Prometheus metrics
open http://localhost:9091

# Grafana dashboards (admin/admin)
open http://localhost:3000

# Jaeger tracing
open http://localhost:16686
```

## 📡 API Examples

### Basic Inference

```bash
curl -X POST http://localhost:8080/inference \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b",
    "prompt": "Explain quantum computing in simple terms",
    "max_tokens": 200,
    "temperature": 0.7
  }'
```

### Streaming Inference

```bash
curl -X POST http://localhost:8080/inference/stream \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "Accept: text/event-stream" \
  -d '{
    "model": "llama-2-7b",
    "prompt": "Write a short poem about AI",
    "max_tokens": 100,
    "stream": true
  }'
```

### OpenAI-Compatible Chat

```bash
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'
```

### Batch Processing

```bash
# Submit batch
curl -X POST http://localhost:8080/batch \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b",
    "requests": [
      {"id": "1", "prompt": "What is Python?"},
      {"id": "2", "prompt": "Explain machine learning"}
    ],
    "max_tokens": 100
  }'

# Check status
curl http://localhost:8080/batch/BATCH_ID \
  -H "Authorization: Bearer YOUR_API_KEY"

# Get results
curl http://localhost:8080/batch/BATCH_ID/results \
  -H "Authorization: Bearer YOUR_API_KEY"
```

### Embeddings

```bash
curl -X POST http://localhost:8080/embeddings \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b",
    "input": ["Hello world", "How are you?"],
    "encoding_format": "float"
  }'
```

## 🏭 Production Setup

### Environment Configuration

```bash
# Create environment file
cat > .env << EOF
INFERNO_LOG_LEVEL=info
INFERNO_LOG_FORMAT=json
INFERNO_MODELS_DIR=/data/models
INFERNO_CACHE_DIR=/data/cache
INFERNO_BIND_ADDRESS=0.0.0.0
INFERNO_PORT=8080
INFERNO_MAX_CONCURRENT_REQUESTS=100
INFERNO_REQUEST_TIMEOUT_SECONDS=300

# Security
INFERNO_AUTH_ENABLED=true
INFERNO_JWT_SECRET=your-secret-key
INFERNO_API_KEY_ENABLED=true
INFERNO_RATE_LIMITING_ENABLED=true
INFERNO_MAX_REQUESTS_PER_MINUTE=1000

# Observability
INFERNO_PROMETHEUS_ENABLED=true
INFERNO_PROMETHEUS_PORT=9090
INFERNO_OTEL_ENABLED=true
INFERNO_OTEL_ENDPOINT=http://jaeger:14268/api/traces
INFERNO_GRAFANA_ENABLED=true
INFERNO_GRAFANA_ENDPOINT=http://grafana:3000
EOF
```

### SSL/TLS Configuration

```bash
# Generate self-signed certificates (for testing)
mkdir -p ssl
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout ssl/inferno.key \
  -out ssl/inferno.crt \
  -subj "/C=US/ST=CA/L=SF/O=Inferno/CN=localhost"
```

### Load Balancing

```nginx
# nginx.conf
upstream inferno_backend {
    least_conn;
    server inferno:8080 max_fails=3 fail_timeout=30s;
    server inferno_2:8080 max_fails=3 fail_timeout=30s;
    server inferno_3:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    location / {
        proxy_pass http://inferno_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_connect_timeout 30s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }
}
```

### Monitoring Setup

```yaml
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'inferno'
    static_configs:
      - targets: ['inferno:9090']
    scrape_interval: 15s
    metrics_path: /metrics

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
```

### Health Checks

```bash
#!/bin/bash
# health-check.sh

INFERNO_URL="http://localhost:8080"
API_KEY="your_api_key"

# Health endpoint
health_status=$(curl -s -o /dev/null -w "%{http_code}" "$INFERNO_URL/health")
if [ "$health_status" != "200" ]; then
    echo "Health check failed: $health_status"
    exit 1
fi

# Model availability
models_count=$(curl -s -H "Authorization: Bearer $API_KEY" \
  "$INFERNO_URL/models" | jq '.models | length')
if [ "$models_count" -eq 0 ]; then
    echo "No models available"
    exit 1
fi

echo "Health check passed"
```

## 🔧 Troubleshooting

### Common Issues

#### 1. Model Not Loading

```bash
# Check model file
ls -la /path/to/models/

# Check logs
docker logs inferno-server

# Validate model
inferno validate /path/to/model.gguf
```

#### 2. Memory Issues

```bash
# Check memory usage
docker stats inferno-server

# Reduce model size or use GPU offloading
curl -X POST http://localhost:8080/models/llama-2-7b/load \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"gpu_layers": 32, "context_size": 2048}'
```

#### 3. Performance Issues

```bash
# Check metrics
curl http://localhost:8080/metrics

# Monitor in Grafana
open http://localhost:3000

# Enable GPU acceleration
inferno gpu status
inferno gpu enable
```

#### 4. API Authentication

```bash
# Generate new API key
inferno security api-key create --user admin --name production

# Test authentication
curl -H "Authorization: Bearer NEW_API_KEY" \
  http://localhost:8080/models
```

#### 5. Network Issues

```bash
# Check Docker networks
docker network ls
docker network inspect inferno_inferno-network

# Test internal connectivity
docker exec inferno-server curl http://prometheus:9090/-/healthy
```

### Debug Mode

```bash
# Run with debug logging
INFERNO_LOG_LEVEL=debug cargo run -- serve

# Enable trace logging
INFERNO_LOG_LEVEL=trace RUST_BACKTRACE=1 cargo run -- serve

# Monitor WebSocket connections
wscat -c ws://localhost:8080/ws
```

### Performance Tuning

```bash
# CPU optimization
export RAYON_NUM_THREADS=8

# Memory optimization
export INFERNO_CACHE_SIZE=4GB
export INFERNO_MODEL_MEMORY_POOL=8GB

# GPU optimization
export CUDA_VISIBLE_DEVICES=0,1
export INFERNO_GPU_MEMORY_FRACTION=0.8
```

## 📚 Additional Resources

- [Full API Documentation](../API.md)
- [Configuration Guide](../CLAUDE.md)
- [Security Best Practices](../SECURITY.md)
- [Performance Optimization](../PERFORMANCE.md)
- [Contributing Guide](../CONTRIBUTING.md)

## 💬 Support

- **GitHub Issues**: [Report bugs or request features](https://github.com/inferno-ai/inferno/issues)
- **Discussions**: [Community discussions](https://github.com/inferno-ai/inferno/discussions)
- **Enterprise**: Contact maintainer for specialized installation assistance (information and pricing available)

## 📄 License

This project is licensed under the MIT OR Apache-2.0 license. See the [LICENSE](../LICENSE) file for details.