oxillama-py
Python bindings for OxiLLaMa — high-performance LLM inference from Python.
Part of the OxiLLaMa workspace — a Pure Rust LLM inference engine.
What It Provides
EngineConfig— configuration dataclass for thread count, context size, tokenizer path, and sampler defaultsEngine— load a GGUF model and generate text; releases the GIL during inferenceAsyncEngine— async/await interface; streams tokens to Python coroutines without blocking the event loopSamplerConfig— all ten sampler knobs withgreedy()andmirostat_v2()static constructorsSpeculativeConfig/SpeculativeEngine— draft + target model pair for faster generationLora— load a LoRA adapter and hot-swap it onto anEngineTokenizer— first-class tokenizer object withencode,decode,encode_batch,apply_chat_templateCancellationToken— cooperative cancellation handle accepted bygenerate()andgenerate_streaming()- Structured exception hierarchy:
OxiLlamaError→LoadError,GenerateError,TokenizerError,GrammarError,QuantError,KvCacheFullError - Full Python type annotations (
.pyistubs) and docstrings - Wheels built with maturin (ABI3, Python 3.8+)
- Optional numpy interop (
embed_numpy(),embed_batch_numpy(),forward_logits_numpy()) vianumpyfeature
Status
Version: 0.1.2 — Tests: 81 passing
Installation
# or
Usage
# Load model
=
# Basic generation (GIL is released during the Rust inference call)
=
# Streaming generation with a callback
# Async engine (non-blocking, event-loop friendly)
=
= await
# Cooperative cancellation
=
# stop from another thread
# Speculative decoding: 3-8x faster on large models
=
=
=
=
# LoRA adapter
=
=
# Tokenizer
=
=
=
# HuggingFace Hub loader
=
Feature Flags
| Feature | Default | Description |
|---|---|---|
numpy |
no | numpy interop for embed_numpy(), embed_batch_numpy(), forward_logits_numpy() |
License
Apache-2.0 — COOLJAPAN OU (Team Kitasan)