oxibonsai-runtime
Inference runtime, sampling, and OpenAI-compatible server for OxiBonsai.
Ties together core, kernels, model, and tokenizer into a production-ready inference stack with advanced sampling, SSE streaming, OpenAI API compatibility, Prometheus metrics, circuit breaker, and comprehensive configuration.
Part of the OxiBonsai project.
Features
InferenceEngine— prefill + autoregressive decode loopEngineBuilder/ConfigBuilder/SamplerBuilder— ergonomic builder API- Sampling presets: Greedy, Balanced, Creative, Code
- Advanced samplers: Mirostat v1/v2, Locally Typical, Eta, Min-P, adaptive
SamplerChain— composable sampling pipeline- Speculative decoding with draft/verify loop
- Beam search with configurable width, length penalty, n-gram blocking
- Token healing and context window management
InferencePipeline— high-level generation API with stop reasons- OpenAI-compatible
/v1/chat/completions,/v1/completions,/v1/embeddings - SSE streaming for real-time token output
- Rate limiting, circuit breaker, CORS, tower middleware
- Admin API:
/admin/status,/admin/config,/admin/cache-stats - Prometheus metrics (
/metrics): tokens/s, latency, request counts - Health endpoint (
/health) with readiness probes - TOML configuration with layered loading (defaults → file → CLI)
Feature Flags
| Flag | Description | Default |
|---|---|---|
server |
Axum HTTP server | ✅ enabled |
rag |
RAG server endpoints | disabled |
wasm |
WASM-safe build | disabled |
metal |
Metal GPU backend | disabled |
Usage
[]
= "0.1.0"
use ;
let engine = new
.model_path
.preset
.max_seq_len
.build?;
License
Apache-2.0 — COOLJAPAN OU