Skip to main content

Module queue

oxillama_server

Module queue

Expand description

Request queue types for the continuous-batching inference worker.

Instead of each HTTP handler holding the engine mutex directly, every handler constructs a BatchRequest and sends it through a tokio::sync::mpsc::Sender. A single background worker receives these requests one at a time and drives the InferenceEngine, eliminating mutex contention across concurrent requests.

Structs§

ModelMeta: Metadata about the loaded model, cached at startup so route handlers do not need to hold a reference to the (now moved) engine.
UsageStats: Token usage statistics for a generation request.

Enums§

BatchRequest: A single inference request dispatched to the worker task.

Type Aliases§

LoraSelection: LoRA adapter selection for a single request.
StreamCallback: Callback invoked for each generated token during streaming.
VocabBytes: Vocabulary byte table: maps token ID to its UTF-8 byte sequence.