hoosh
AI inference gateway for Rust.
Multi-provider LLM routing, local model serving, speech-to-text, and token budget management — in a single crate. OpenAI-compatible HTTP API. Built on ai-hwaccel for hardware-aware model placement.
Name: Hoosh (Persian: هوش) — intelligence, the word for AI. Extracted from the AGNOS LLM gateway as a standalone, reusable engine.
What it does
hoosh is the inference backend — it routes, caches, rate-limits, and budget-tracks LLM requests across providers. It is not a model trainer (that's Synapse) or a model file manager. Applications build their AI features on top of hoosh.
| Capability | Details |
|---|---|
| 14 LLM providers | Ollama, llama.cpp, Synapse, LM Studio, LocalAI, OpenAI, Anthropic, DeepSeek, Mistral, Google, Groq, Grok, OpenRouter, Whisper |
| OpenAI-compatible API | /v1/chat/completions, /v1/models, /v1/embeddings — streaming SSE |
| Provider routing | Priority, round-robin, lowest-latency (EMA), direct — with model pattern matching |
| Authentication | Bearer token auth middleware with constant-time comparison |
| Rate limiting | Per-provider sliding window RPM limits |
| Token budgets | Per-agent named pools with reserve/commit/release lifecycle |
| Cost tracking | Per-provider/model cost accumulation with static pricing table |
| Observability | Prometheus /metrics, OpenTelemetry (feature-gated), cryptographic audit log |
| Health checks | Background periodic checks, automatic failover, heartbeat tracking (majra) |
| Response caching | Thread-safe DashMap cache with TTL eviction |
| Request queuing | Priority queue for inference requests (majra) |
| Event bus | Pub/sub for provider health changes, inference events (majra) |
| Hot-reload | SIGHUP or POST /v1/admin/reload — no restart required |
| TLS security | Certificate pinning for remote providers, mTLS for local |
| Speech | whisper.cpp STT + TTS via HTTP backend (feature-gated) |
| Hardware-aware | ai-hwaccel detects GPUs/TPUs/NPUs for model placement |
| Local-first | Prefers on-device inference; remote APIs as fallback |
Architecture
Clients (tarang, daimon, agnoshi, consumer apps)
│
▼
Auth ──▶ Rate Limiter ──▶ Router (priority, round-robin, lowest-latency)
│
┌─────────────────────────┤
│ │
▼ ▼
Local backends Remote APIs (TLS pinned / mTLS)
(Ollama, llama.cpp, …) (OpenAI, Anthropic, DeepSeek, …)
│ │
└────────┬────────────────┘
▼
Cache ◀── Budget ◀── Cost Tracker
│
Metrics ◀── Audit Log ◀── Event Bus (majra)
See docs/architecture/overview.md for the full architecture document.
Quick start
As a library
[]
= "0.21"
use ;
async
As a server
# Start the gateway
# One-shot inference
# List models across all providers
# System info (hardware, providers)
OpenAI-compatible API
Features
| Feature | Backend | Default |
|---|---|---|
ollama |
Ollama REST API | yes |
llamacpp |
llama.cpp server | yes |
synapse |
Synapse server | yes |
lmstudio |
LM Studio API | yes |
localai |
LocalAI API | yes |
openai |
OpenAI API | yes |
anthropic |
Anthropic Messages API | yes |
deepseek |
DeepSeek API | yes |
mistral |
Mistral API | yes |
groq |
Groq API | yes |
openrouter |
OpenRouter API | yes |
grok |
xAI Grok API | yes |
whisper |
whisper.cpp STT | no |
piper |
Piper TTS | no |
hwaccel |
ai-hwaccel hardware detection | yes |
otel |
OpenTelemetry tracing | no |
all-providers |
All LLM providers | yes |
# Minimal: just Ollama + llama.cpp for local inference
= { = "0.20", = false, = ["ollama", "llamacpp"] }
# With speech-to-text
= { = "0.20", = ["whisper"] }
Key types
HooshClient
HTTP client for downstream consumers. Speaks the OpenAI-compatible API.
let client = new;
let healthy = client.health.await?;
let models = client.list_models.await?;
InferenceRequest / InferenceResponse
use ;
let req = InferenceRequest ;
Router
Provider selection with model pattern matching:
use ;
use ProviderType;
let routes = vec!;
let router = new;
let selected = router.select; // → Ollama
TokenBudget
Per-agent token accounting:
use ;
let mut budget = new;
budget.add_pool;
// Before inference: reserve estimated tokens
budget.reserve;
// After inference: report actual usage
budget.report;
Dependencies
| Crate | Role |
|---|---|
| ai-hwaccel | Hardware detection for model placement |
| majra | Priority queues, pub/sub events, heartbeat tracking |
| axum | HTTP server |
| reqwest | HTTP client for remote providers (rustls-tls) |
| prometheus | Metrics endpoint |
| dashmap | Thread-safe caches and registries |
| hmac + sha2 | Audit chain cryptography |
| whisper-rs | whisper.cpp Rust bindings (optional) |
| tokio | Async runtime |
Who uses this
| Project | Usage |
|---|---|
| AGNOS (llm-gateway) | Wraps hoosh as the system-wide inference gateway |
| tarang | Transcription, content description, AI media analysis |
| aethersafta | Real-time transcription/captioning for streams |
| AgnosAI | Agent crew LLM routing |
| Synapse | Inference backend + model management |
| All AGNOS consumer apps | Via daimon or direct HTTP |
Roadmap
| Version | Milestone | Status |
|---|---|---|
| 0.20.3 | Core gateway + providers | Done |
| 0.21.5 | Auth, observability, messaging | Done |
| 0.22.3 | Tool use, context management, privacy routing | Next |
| 0.23.3 | Speech & audio improvements | Planned |
| 1.0.0 | Stable API, 90%+ coverage | Target |
Full details: docs/development/roadmap.md
Building from source
# Build (all default providers, no whisper)
# Build with whisper support (requires whisper.cpp system lib)
# Run tests
# Run all CI checks locally
Versioning
Pre-1.0 releases use 0.D.M (day.month) SemVer — e.g. 0.20.3 = March 20th.
Post-1.0 follows standard SemVer.
The VERSION file is the single source of truth. Use ./scripts/version-bump.sh <version> to update.
License
AGPL-3.0-only. See LICENSE for details.
Contributing
- Fork and create a feature branch
- Run
make check(fmt + clippy + test + audit) - Open a PR against
main