oxillama-py 0.1.3

Python bindings for OxiLLaMa LLM inference engine
Documentation
Quickstart
==========

Installation
------------

.. code-block:: bash

   pip install oxillama-py

Or build from source with `maturin <https://github.com/PyO3/maturin>`_:

.. code-block:: bash

   cd crates/oxillama-py
   maturin develop --features pyo3/extension-module

Basic Usage
-----------

.. code-block:: python

   from oxillama_py import Engine, EngineConfig, SamplerConfig

   config = EngineConfig("model.gguf", num_threads=4)
   engine = Engine(config)

   # Generate text
   result = engine.generate("What is Rust?", max_tokens=100)
   print(result)

   # Streaming generation
   def on_token(tok: str, token_id: int, is_final: bool) -> None:
       print(tok, end="", flush=True)

   engine.generate_streaming("Tell me about COOLJAPAN.", callback=on_token)

Loading from HuggingFace Hub
-----------------------------

.. code-block:: python

   from oxillama_py import Engine

   engine = Engine.from_hub(
       "TheBloke/Llama-2-7B-GGUF",
       filename="llama-2-7b.Q4_K_M.gguf",
   )
   print(engine.generate("Hello!", max_tokens=50))