Skip to main content

Crate oxillama_server

Crate oxillama_server 

Source
Expand description

§oxillama-server

OpenAI-compatible HTTP API server for OxiLLaMa.

§Endpoints

MethodPathDescription
POST/v1/chat/completionsChat completion
POST/v1/completionsText completion
POST/v1/embeddingsText embeddings
GET/v1/modelsList loaded models
GET/healthHealth check
POST/v1/batchesCreate batch job (disk-spooled)
GET/v1/batches/:idRetrieve batch job
GET/v1/batches/:id/outputStream batch output JSONL
POST/v1/batches/:id/cancelCancel batch job
GET/v1/batchesList batch jobs
POST/v1/threadsCreate Assistants API thread
GET/v1/threads/:thread_idRetrieve thread
POST/v1/threads/:thread_id/messagesAppend message to thread
GET/v1/threads/:thread_id/messagesList thread messages
POST/v1/threads/:thread_id/runsCreate and enqueue a run
GET/v1/threads/:thread_id/runs/:run_idGet run status
POST/v1/threads/:thread_id/runs/:run_id/cancelCancel a run
POST/admin/models/loadBackground-load model (admin)
POST/admin/models/unloadUnload model (admin)
GET/admin/modelsList model pool (admin)
GET/admin/statsServer stats (admin)
GET/admin/healthExtended health (admin)
POST/admin/lorasRegister a LoRA adapter (admin)
DELETE/admin/loras/{name}Unregister a LoRA adapter (admin)
GET/admin/lorasList registered LoRA adapters (admin)

Re-exports§

pub use app::build_app;
pub use auth::ApiKeys;
pub use config::ServerConfig;
pub use error::ServerError;
pub use error::ServerResult;
pub use metrics::Metrics;
pub use queue::BatchRequest;
pub use queue::LoraSelection;
pub use queue::VocabBytes;
pub use rate_limit::PerKeyRateLimiter;
pub use rate_limit::RateLimiter;
pub use responses_store::ResponseStore;
pub use router::ModelLoader;
pub use router::ModelPool;
pub use router::ModelSpec;
pub use shutdown::shutdown_signal;
pub use shutdown::ShutdownSignal;
pub use shutdown::ShutdownTrigger;
pub use state::AppState;
pub use threads::new_run_queue;
pub use threads::RunQueueSender;
pub use threads::ThreadStore;
pub use worker::spawn_inference_worker;

Modules§

admin
Admin API — fleet management endpoints under /admin/*.
app
Application builder — constructs the axum router with all routes.
auth
Bearer-token authentication middleware.
batch
OpenAI-compatible Batch API (/v1/batches).
batch_spool
Disk-spooled OpenAI Batch API backend.
body_limit
Request body-size limit configuration.
config
Server configuration.
error
Error types for the HTTP API server.
files_store
Disk-backed persistent store for the Files API.
metrics
Pure-Rust Prometheus-compatible metrics using lock-free atomics.
queue
Request queue types for the continuous-batching inference worker.
rate_limit
Token-bucket rate limiter middleware.
responses_store
In-memory store for Responses API objects.
router
Multi-model LRU warm-pool router.
routes
API route handlers for the OpenAI-compatible server.
shutdown
Graceful shutdown handler.
sse
Server-Sent Events (SSE) streaming support.
state
Shared application state for the API server.
threads
OpenAI Assistants v2 API — threads, messages, runs, steps, and SSE streaming.
tracing_layer
Structured tracing middleware.
worker
Inference worker — drains the request queue on a dedicated blocking thread.
ws
WebSocket streaming endpoint for /v1/chat/ws.