Nexus

One API endpoint. Any backend. Zero configuration.

Nexus is a distributed LLM model serving orchestrator that unifies heterogeneous inference backends behind a single, intelligent API gateway.

Features

🔍 Auto-Discovery: Automatically finds LLM backends on your network via mDNS
🎯 Intelligent Routing: Routes requests based on model capabilities and load
🔄 Transparent Failover: Automatically retries with fallback backends
🔌 OpenAI-Compatible: Works with any OpenAI API client
⚡ Zero Config: Just run it - works out of the box with Ollama
📊 Structured Logging: Queryable JSON logs for every request with correlation IDs (quickstart)

Supported Backends

Backend	Status	Notes
Ollama	✅ Supported	Auto-discovery via mDNS
LM Studio	✅ Supported	OpenAI-compatible API
vLLM	✅ Supported	Static configuration
llama.cpp server	✅ Supported	Static configuration
exo	✅ Supported	Auto-discovery via mDNS
OpenAI	✅ Supported	Cloud fallback
LocalAI	🔜 Planned

Quick Start

From Source

# Install
cargo install --path .

# Generate a configuration file
nexus config init

# Run with auto-discovery
nexus serve

# Or with a custom config file
nexus serve --config nexus.toml

Docker

# Run with default settings
docker run -d -p 3000:3000 leocamello/nexus

# Run with custom config
docker run -d -p 3000:3000 \
  -v $(pwd)/nexus.toml:/home/nexus/nexus.toml \
  leocamello/nexus serve --config nexus.toml

# Run with host network (for mDNS discovery)
docker run -d --network host leocamello/nexus

From GitHub Releases

Download pre-built binaries from Releases.

CLI Commands

# Start the server
nexus serve [--config FILE] [--port PORT] [--host HOST]

# List backends
nexus backends list [--json] [--status healthy|unhealthy|unknown]

# Add a backend manually (auto-detects type)
nexus backends add http://localhost:11434 [--name NAME] [--type ollama|vllm|llamacpp]

# Remove a backend
nexus backends remove <ID>

# List available models
nexus models [--json] [--backend ID]

# Show system health
nexus health [--json]

# Generate config file
nexus config init [--output FILE] [--force] [--minimal]

# Generate shell completions
nexus completions bash > ~/.bash_completion.d/nexus
nexus completions zsh > ~/.zsh/completions/_nexus
nexus completions fish > ~/.config/fish/completions/nexus.fish

Environment Variables

Variable	Description	Default
`NEXUS_CONFIG`	Config file path	`nexus.toml`
`NEXUS_PORT`	Listen port	`8000`
`NEXUS_HOST`	Listen address	`0.0.0.0`
`NEXUS_LOG_LEVEL`	Log level (trace/debug/info/warn/error)	`info`
`NEXUS_LOG_FORMAT`	Log format (pretty/json)	`pretty`
`NEXUS_DISCOVERY`	Enable mDNS discovery	`true`
`NEXUS_HEALTH_CHECK`	Enable health checking	`true`

Precedence: CLI args > Environment variables > Config file > Defaults

API Usage

Once running, Nexus exposes an OpenAI-compatible API:

# Health check
curl http://localhost:8000/health

# List available models
curl http://localhost:8000/v1/models

# Chat completion (non-streaming)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:70b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Chat completion (streaming)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:70b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Web Dashboard

Nexus includes a web dashboard for real-time monitoring and observability. Access it at http://localhost:8000/ in your browser.

Features:

📊 Real-time backend health monitoring with status indicators
🗺️ Model availability matrix showing which models are available on which backends
📝 Request history with last 100 requests, durations, and error details
🔄 WebSocket-based live updates (with HTTP polling fallback)
📱 Fully responsive - works on desktop, tablet, and mobile
🌙 Dark mode support (system preference)
🚀 Works without JavaScript (graceful degradation with auto-refresh)

The dashboard provides a visual overview of your Nexus cluster, making it easy to monitor backend health, track model availability, and debug request issues in real-time.

With Claude Code / Continue.dev

Point your AI coding assistant to http://localhost:8000 as the API endpoint.

With OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3:70b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Observability

Nexus exposes metrics for monitoring and debugging:

# Prometheus metrics (for Grafana, Prometheus, etc.)
curl http://localhost:8000/metrics

# JSON stats (for dashboards and debugging)
curl http://localhost:8000/v1/stats | jq

Prometheus metrics include request counters, duration histograms, error rates, backend latency, token usage, and fleet state gauges. Configure your Prometheus scraper to target http://<nexus-host>:8000/metrics.

JSON stats provide an at-a-glance view with uptime, per-backend request counts, latency, and pending request depth.

Configuration

# nexus.toml

[server]
host = "0.0.0.0"
port = 8000

[discovery]
enabled = true

[[backends]]
name = "local-ollama"
url = "http://localhost:11434"
type = "ollama"
priority = 1

[[backends]]
name = "gpu-server"
url = "http://192.168.1.100:8000"
type = "vllm"
priority = 2

Architecture

┌─────────────────────────────────────────────┐
│           Nexus Orchestrator                │
│  - Discovers backends via mDNS              │
│  - Tracks model capabilities                │
│  - Routes to best available backend         │
│  - OpenAI-compatible API                    │
└─────────────────────────────────────────────┘
        │           │           │
        ▼           ▼           ▼
   ┌────────┐  ┌────────┐  ┌────────┐
   │ Ollama │  │  vLLM  │  │  exo   │
   │  7B    │  │  70B   │  │  32B   │
   └────────┘  └────────┘  └────────┘

Development

# Build
cargo build

# Run tests
cargo test

# Run with logging
RUST_LOG=debug cargo run -- serve

# Check formatting
cargo fmt --check

# Lint
cargo clippy

License

Apache License 2.0 - see LICENSE for details.

Related Projects

exo - Distributed AI inference
LM Studio - Desktop app for local LLMs
Ollama - Easy local LLM serving
vLLM - High-throughput LLM serving
LiteLLM - Cloud LLM API router

nexus-orchestrator 0.2.0