Expand description
§oxillama-py
PyO3 Python bindings for the OxiLLaMa Pure-Rust LLM inference engine.
§Quick start
import oxillama_py
config = oxillama_py.EngineConfig(model_path="model.gguf", context_size=4096)
engine = oxillama_py.Engine(config)
engine.load_model()
text = engine.generate("Hello", max_tokens=128)
emb = engine.embed("Hello world") # List\[float\]
toks = engine.tokenize("Hello") # List[int]
engine.generate_streaming(
"Hello",
max_tokens=128,
callback=lambda tok: print(tok, end="", flush=True),
)§Module structure
| Python class | Rust source |
|---|---|
EngineConfig | engine.rs |
Engine | engine.rs |
SamplerConfig | sampler.rs |
SpeculativeConfig | speculative.rs |
SpeculativeEngine | speculative.rs |
Lora | lora.rs |
Modules§
- async_
support - Async Python support for OxiLLaMa.
- callback
- Python-callable streaming bridge utilities.
- cancel
- Python-accessible
CancellationTokenfor cooperative cancellation of generation. - chat_
template - Pure-Rust chat template engine for common HuggingFace prompt formats.
- dlpack
- DLPack v0.8 capsule producer/consumer for f32 CPU tensors.
- engine
- Python wrappers for
InferenceEngineandEngineConfig. - error
- Error conversion from Rust error types to
pyo3::PyErr. - lora
- Python wrapper for
LoadedLora. - sampler
- Python wrapper for
SamplerConfig. - snapshot
- speculative
- Python wrappers for
SpeculativeEngineandSpeculativeConfig. - tokenizer
- Python wrapper for
TokenizerBridge— standalone tokenizer access. - torch_
interop - Torch interop registration hook.