oxillama-py
Python bindings for OxiLLaMa — high-performance LLM inference from Python.
Part of the OxiLLaMa workspace — a Pure Rust LLM inference engine.
What It Provides
Engine— load a GGUF model and generate text; releases the GIL during inferenceSpeculativeEngine— draft + target model pair for faster generationLoadedLora— load a LoRA adapter and hot-swap it onto anEngine- Full Python type annotations and docstrings
- Wheels built with maturin
Installation
# or
Usage
# Load model
=
# Basic generation (GIL is released during the Rust inference call)
=
# Speculative decoding: 3-8x faster on large models
=
=
=
=
# LoRA adapter
=
=
License
Apache-2.0 — COOLJAPAN OU (Team Kitasan)