oxillama-server
OpenAI-compatible HTTP API server for OxiLLaMa — drop-in replacement for llama-server.
Part of the OxiLLaMa workspace — a Pure Rust LLM inference engine.
Status
Version: 0.1.2 — Tests: 165 passing, 1 skipped — Status: Alpha (~98% complete)
What It Provides
Inference Endpoints
POST /v1/chat/completions— OpenAI chat completions (streaming via SSE + non-streaming)GET /v1/chat/ws— WebSocket streaming transport (alternative to SSE)POST /v1/completions— Legacy text completionsPOST /v1/embeddings— Text embedding extractionGET /v1/models— List available loaded modelsGET /health— Liveness probe
Batch API
POST /v1/batches— Submit a batch of inference requestsGET /v1/batches— List all batchesGET /v1/batches/:id— Get status and results for a batchPOST /v1/batches/:id/cancel— Cancel a pending or in-progress batch
Admin API (loopback-bound, bearer auth)
POST /admin/models/load— Load a model into the warm poolPOST /admin/models/unload— Unload a model from the warm poolGET /admin/models— List currently loaded models and pool stateGET /admin/stats— Runtime statistics and memory usage
Features
- Server-Sent Events (SSE) streaming with
deltachunked responses - WebSocket streaming as an alternative low-latency transport
- JSON request/response fully compatible with OpenAI SDK clients
- Tool/function calling — JSON Schema to GBNF grammar conversion,
tool_callsin streaming and non-streaming responses - Multi-model LRU warm-pool router (
router/pool.rs) — supports K simultaneously loaded models with LRU eviction - Batch disk-spool backend (
batch_spool/) — batch jobs persist across server restarts
Usage
Start the server from the CLI:
# Via the oxillama binary
# Or with extra options
Query it with curl:
|
Or use the official OpenAI Python SDK:
=
=
License
Apache-2.0 — COOLJAPAN OU (Team Kitasan)