Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
🔥 Inferno - Your Personal AI Infrastructure
Run any AI model locally with enterprise-grade performance and privacy
Inferno is a production-ready AI inference server that runs entirely on your hardware. Think of it as your private ChatGPT that works offline, supports any model format, and gives you complete control over your AI infrastructure.
🎯 Why Inferno?
🔒 Privacy First
- 100% Local: All processing happens on your hardware
- No Cloud Dependency: Works completely offline
- Your Data Stays Yours: Zero telemetry or external data transmission
🚀 Universal Model Support
- GGUF Models: Native support for Llama, Mistral, CodeLlama, and more
- ONNX Models: Run models from PyTorch, TensorFlow, scikit-learn
- Format Conversion: Convert between GGUF ↔ ONNX ↔ PyTorch ↔ SafeTensors
- Auto-Optimization: Automatic quantization and hardware optimization
⚡ Enterprise Performance
- GPU Acceleration: Metal (Apple Silicon, 13x speedup ✅), NVIDIA, AMD, Intel support
- Smart Caching: Remember previous responses for instant results
- Batch Processing: Handle thousands of requests efficiently
- Load Balancing: Distribute work across multiple models/GPUs
🔧 Developer Friendly
- OpenAI-Compatible API: Drop-in replacement for ChatGPT API
- REST & WebSocket: Standard APIs plus real-time streaming
- Multiple Languages: Python, JavaScript, Rust, cURL examples
- Docker Ready: One-command deployment
- Smart CLI: Typo detection, helpful error messages, setup guidance
📦 Installation
Choose your preferred installation method:
🍎 macOS
Desktop App (NEW in v0.5.0) - Recommended for macOS users
Native macOS application with Metal GPU capabilities detection, optimized for Apple Silicon (M1/M2/M3/M4)
- Visit Releases
- Download
Inferno.dmg(universal binary for Intel & Apple Silicon) - Open the DMG and drag Inferno to Applications
- Launch from Applications folder
Features:
- 🎨 Native macOS UI with vibrancy effects
- 🔔 System tray integration with live metrics
- ⚡ Metal GPU acceleration with 13x speedup (Phases 2.1-2.3 ✅)
- 🍎 Apple Silicon optimization (M1/M2/M3/M4 detection)
- 🔄 Automatic model downloads and updates
- 📊 Real-time performance monitoring with GPU metrics
- 🔐 Built-in security and API key management
- 🧠 Neural Engine detection for AI workloads
Build from source:
# Clone and build
# Development mode with hot reload
&&
CLI Tools (for automation and scripting)
Homebrew
# Add tap and install
# Or directly
# Start as service
Quick Install Script
|
🐳 Docker
GitHub Container Registry
# Pull the latest image
# Run with GPU support
# With custom models directory
Docker Compose
version: '3.8'
services:
inferno:
image: ghcr.io/ringo380/inferno:latest
ports:
- "8080:8080"
volumes:
- ./models:/home/inferno/.inferno/models
- ./config:/home/inferno/.inferno/config
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities:
📦 Package Managers
Cargo (Rust)
# From crates.io
# From GitHub Packages
NPM (Desktop App)
# From GitHub Packages
# From npm registry
🐧 Linux
Binary Download
# Download for your architecture
# or
# Make executable and move to PATH
🪟 Windows
Binary Download
- Download
inferno-windows-x86_64.exefrom Releases - Add to your PATH or run directly
Via Cargo
cargo install inferno
🔨 Build from Source
# Clone the repository
# Build release binary
# Install globally (optional)
# Build desktop app (optional)
&& &&
⬆️ Upgrading
Automatic Updates (Built-in)
Package Managers
# Homebrew
# Docker
# Cargo
# NPM
Note: DMG and installer packages automatically detect existing installations and preserve your settings during upgrade.
🔐 Verify Installation
# Check version
# Verify GPU support
# Run health check
🚀 Quick Start
# List available models
# Run inference
# Start HTTP API server
# Launch terminal UI
# Launch desktop app (if installed from DMG)
✨ Key Features
🧠 AI Backends
- ✅ Real GGUF Support: Full llama.cpp integration
- ✅ Real ONNX Support: Production ONNX Runtime with GPU acceleration
- ✅ Model Conversion: Real-time format conversion with optimization
- ✅ Quantization: Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, F16, F32 support
🏢 Enterprise Features
- ✅ Authentication: JWT tokens, API keys, role-based access
- ✅ Monitoring: Prometheus metrics, OpenTelemetry tracing
- ✅ Audit Logging: Encrypted logs with multi-channel alerting
- ✅ Batch Processing: Cron scheduling, retry logic, job dependencies
- ✅ Caching: Multi-tier caching with compression and persistence
- ✅ Load Balancing: Distribute inference across multiple backends
🔌 APIs & Integration
- ✅ OpenAI Compatible: Use existing ChatGPT client libraries
- ✅ REST API: Standard HTTP endpoints for all operations
- ✅ WebSocket: Real-time streaming and bidirectional communication
- ✅ CLI Interface: 40+ commands for all AI/ML operations
- ✅ Desktop App: Cross-platform Tauri application
🏗️ Architecture
Built with a modular, trait-based architecture supporting pluggable backends:
src/
├── main.rs # CLI entry point
├── lib.rs # Library exports
├── config.rs # Configuration management
├── backends/ # AI model execution backends
├── cli/ # 40+ CLI command modules
├── api/ # HTTP/WebSocket APIs
├── batch/ # Batch processing system
├── models/ # Model discovery and metadata
└── [Enterprise] # Advanced production features
🔧 Configuration
Create inferno.toml:
# Basic settings
= "/path/to/models"
= "info"
[]
= "0.0.0.0"
= 8080
[]
= true
= 4096
= 64
[]
= true
= "zstd"
= 10
🛠️ Development
See CLAUDE.md for comprehensive development documentation.
# Run tests
# Format code
# Run linter
# Full verification
📄 License
Licensed under either of:
- Apache License, Version 2.0
- MIT License
🔥 Ready to take control of your AI infrastructure? 🔥
Built with ❤️ by the open source community