Crate oxillama_server

Expand description

§oxillama-server

OpenAI-compatible HTTP API server for OxiLLaMa.

Method	Path	Description
POST	`/v1/chat/completions`	Chat completion
POST	`/v1/completions`	Text completion
POST	`/v1/embeddings`	Text embeddings
GET	`/v1/models`	List loaded models
GET	`/health`	Health check
POST	`/v1/batches`	Create batch job (disk-spooled)
GET	`/v1/batches/:id`	Retrieve batch job
GET	`/v1/batches/:id/output`	Stream batch output JSONL
POST	`/v1/batches/:id/cancel`	Cancel batch job
GET	`/v1/batches`	List batch jobs
POST	`/v1/threads`	Create Assistants API thread
GET	`/v1/threads/:thread_id`	Retrieve thread
POST	`/v1/threads/:thread_id/messages`	Append message to thread
GET	`/v1/threads/:thread_id/messages`	List thread messages
POST	`/v1/threads/:thread_id/runs`	Create and enqueue a run
GET	`/v1/threads/:thread_id/runs/:run_id`	Get run status
POST	`/v1/threads/:thread_id/runs/:run_id/cancel`	Cancel a run
POST	`/admin/models/load`	Background-load model (admin)
POST	`/admin/models/unload`	Unload model (admin)
GET	`/admin/models`	List model pool (admin)
GET	`/admin/stats`	Server stats (admin)
GET	`/admin/health`	Extended health (admin)
POST	`/admin/loras`	Register a LoRA adapter (admin)
DELETE	`/admin/loras/{name}`	Unregister a LoRA adapter (admin)
GET	`/admin/loras`	List registered LoRA adapters (admin)

admin: Admin API — fleet management endpoints under /admin/*.
app: Application builder — constructs the axum router with all routes.
auth: Bearer-token authentication middleware.
batch: OpenAI-compatible Batch API (/v1/batches).
batch_spool: Disk-spooled OpenAI Batch API backend.
body_limit: Request body-size limit configuration.
config: Server configuration.
error: Error types for the HTTP API server.
files_store: Disk-backed persistent store for the Files API.
metrics: Pure-Rust Prometheus-compatible metrics using lock-free atomics.
queue: Request queue types for the continuous-batching inference worker.
rate_limit: Token-bucket rate limiter middleware.
responses_store: In-memory store for Responses API objects.
router: Multi-model LRU warm-pool router.
routes: API route handlers for the OpenAI-compatible server.
shutdown: Graceful shutdown handler.
sse: Server-Sent Events (SSE) streaming support.
state: Shared application state for the API server.
threads: OpenAI Assistants v2 API — threads, messages, runs, steps, and SSE streaming.
tracing_layer: Structured tracing middleware.
worker: Inference worker — drains the request queue on a dedicated blocking thread.
ws: WebSocket streaming endpoint for /v1/chat/ws.