Expand description
§oxillama-server
OpenAI-compatible HTTP API server for OxiLLaMa.
§Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions | Chat completion |
| POST | /v1/completions | Text completion |
| POST | /v1/embeddings | Text embeddings |
| GET | /v1/models | List loaded models |
| GET | /health | Health check |
| POST | /v1/batches | Create batch job (disk-spooled) |
| GET | /v1/batches/:id | Retrieve batch job |
| GET | /v1/batches/:id/output | Stream batch output JSONL |
| POST | /v1/batches/:id/cancel | Cancel batch job |
| GET | /v1/batches | List batch jobs |
| POST | /v1/threads | Create Assistants API thread |
| GET | /v1/threads/:thread_id | Retrieve thread |
| POST | /v1/threads/:thread_id/messages | Append message to thread |
| GET | /v1/threads/:thread_id/messages | List thread messages |
| POST | /v1/threads/:thread_id/runs | Create and enqueue a run |
| GET | /v1/threads/:thread_id/runs/:run_id | Get run status |
| POST | /v1/threads/:thread_id/runs/:run_id/cancel | Cancel a run |
| POST | /admin/models/load | Background-load model (admin) |
| POST | /admin/models/unload | Unload model (admin) |
| GET | /admin/models | List model pool (admin) |
| GET | /admin/stats | Server stats (admin) |
| GET | /admin/health | Extended health (admin) |
| POST | /admin/loras | Register a LoRA adapter (admin) |
| DELETE | /admin/loras/{name} | Unregister a LoRA adapter (admin) |
| GET | /admin/loras | List registered LoRA adapters (admin) |
Re-exports§
pub use app::build_app;pub use auth::ApiKeys;pub use config::ServerConfig;pub use error::ServerError;pub use error::ServerResult;pub use metrics::Metrics;pub use queue::BatchRequest;pub use queue::LoraSelection;pub use queue::VocabBytes;pub use rate_limit::PerKeyRateLimiter;pub use rate_limit::RateLimiter;pub use responses_store::ResponseStore;pub use router::ModelLoader;pub use router::ModelPool;pub use router::ModelSpec;pub use shutdown::shutdown_signal;pub use shutdown::ShutdownSignal;pub use shutdown::ShutdownTrigger;pub use state::AppState;pub use threads::new_run_queue;pub use threads::RunQueueSender;pub use threads::ThreadStore;pub use worker::spawn_inference_worker;
Modules§
- admin
- Admin API — fleet management endpoints under
/admin/*. - app
- Application builder — constructs the axum router with all routes.
- auth
- Bearer-token authentication middleware.
- batch
- OpenAI-compatible Batch API (
/v1/batches). - batch_
spool - Disk-spooled OpenAI Batch API backend.
- body_
limit - Request body-size limit configuration.
- config
- Server configuration.
- error
- Error types for the HTTP API server.
- files_
store - Disk-backed persistent store for the Files API.
- metrics
- Pure-Rust Prometheus-compatible metrics using lock-free atomics.
- queue
- Request queue types for the continuous-batching inference worker.
- rate_
limit - Token-bucket rate limiter middleware.
- responses_
store - In-memory store for Responses API objects.
- router
- Multi-model LRU warm-pool router.
- routes
- API route handlers for the OpenAI-compatible server.
- shutdown
- Graceful shutdown handler.
- sse
- Server-Sent Events (SSE) streaming support.
- state
- Shared application state for the API server.
- threads
- OpenAI Assistants v2 API — threads, messages, runs, steps, and SSE streaming.
- tracing_
layer - Structured tracing middleware.
- worker
- Inference worker — drains the request queue on a dedicated blocking thread.
- ws
- WebSocket streaming endpoint for
/v1/chat/ws.